Quantizing LLMs for inference

Motivation

Let’s start by doing some arithmetic about large language models (LLMs). These are neural networks with huge parameter counts, with state-of-the-art open-weights models (i.e., ones you can download) having parameter counts of the order of 100B ($10^{11}$) or so (and usable ones around one order of magnitude smaller).

Take the latest SOTA release Qwen 3 235B-A22B, for instance, which has roughly 235B parameters. If all these parameters were to be stored in a naive array of 32-bit (4 byte) floating point numbers, this model would require around 940 GB of storage as well as memory for a usable speed. Running this model purely on CPU with dual channel DDR4 RAM (which is likely the kind of RAM you have on your computer) would take you multiple seconds to output a single token/word (and even this is quite fast for the total size of the model because the architecture is what is called a Mixture of Experts, more on that later, so don’t worry yet).

For usable inference (i.e., generating outputs from the model) in this context, you require GPUs (or TPUs, but those are in the cloud), so you would require around a TB of VRAM (the RAM equivalent for GPUs) for this. Considering the GPUs with the most VRAM per unit of currency, the GPUs alone will cost upwards of $25000 by an optimistic estimate, let alone power costs and the rest of the parts of the mini-cluster you’ll have to set up (and perhaps a soundproof room to avoid working next to a jet-engine).

Meanwhile, thanks to quantization, I’m running this LLM (albeit with some quality loss) on a setup with 48GB of VRAM and 128GB of RAM, at a very usable 7 tokens per second (around 5 words per second). Not a bad deal, huh? (You can in fact go faster, but I chose not to).

You can also choose to run much smaller models (logically coherent outputs start at ~0.5B parameters, but larger models are way better) given your hardware constraints - even CPU-only inference can be fast - but quantization makes inference performance better regardless of the size.

Now that I’ve gotten your attention, let’s understand what quantization is, an overview of how it works and its benefits for inference. It’s quite a simple idea at heart, but I’ll take some time to explain the relevant context in the following sections.

A quick example of running a small LLM on your computer with/without a GPU

To use an LLM, we need two things - the LLM's weights and some software that can use those weights.

Since we only really care about getting things up and running on as broad hardware as possible, we’ll use llama.cpp. To avoid as much setup as possible, I’d recommend llamafile (I personally use llama.cpp for bleeding edge llama.cpp features). To get it, download the llamafile executable from here (direct link). This will not be executable by default, so look at the Quickstart section in the README here for allowing it to be run as an executable on your specific system - note that the default instructions are for when the weights and llamafile are packaged together, which is not what we will be doing. (If for some reason, llamafile doesn’t work, you can go to the llama.cpp GitHub page, look at the latest release, and download the relevant executables for your system or build them on your own using the build instructions).

Once this is done, we can move to downloading LLM weights. If you have 3-4 GB of RAM available, you can try out Llama 3.2 3B by downloading it from here (click on “download”). If you have more RAM and are fine with the LLM running slower, you can try out Llama 3.1 8B by downloading it from here. Don’t expect factual correctness from these LLMs, they are quite small and the usual LLMs you get served in ChatGPT/Claude/Gemini are pretty large (hundreds of billions, if not trillions of parameters) compared to these models (3 billion and 8 billion parameters respectively), and also have search integrated these days.

You’ll now have two files - one being the executable (let’s call it a), and the other being the weights file with the extension gguf, let’s call it b. In the terminal, you should run ./a -m b (or if it is Windows, you may need .\a -m b).

This will give you a prompt at your terminal where you can give inputs.

You might find this to be a bit slow, especially with the larger model, so in case you have a GPU, you’re in luck. We can selectively offload layers of the LLM to the GPU using -ngl l where l is the number of layers that need to be offloaded. If you can fit the whole model into your GPU’s VRAM, using l = 9999 should work. If not, you’ll have to find the smallest l that works (you can binary search on this). When most layers are on the GPU, the inference can be really fast. As an estimate, if you’re using the 3B LLM we downloaded above, a GPU with 4GB of VRAM should be able to fit the whole model. If you’re using the 8B LLM, a GPU with 8GB of VRAM should be fit the whole model.

If you want a web interface, the URL mentioned in the command’s output before the prompt should work too. Check out the output of --help to see what else can be done, and check out llama.cpp if you want to be at the bleeding edge and/or build executables for platforms not supported by llamafile, like phones. Don’t forget to use your favourite search engine/LLM to tell you about things that don’t make sense.

llama.cpp can run a lot of LLMs (and not just the Llama series of models) - you can search for them by searching for the word GGUF on Hugging Face (GGUFs by bartowski, unsloth and mradermacher are quite reliable), or go to the official model pages and look for the quantization node in the model tree on the right of the model card. Notice that both of the models we downloaded had an Instruct suffix - this means that these models were instruction tuned for chat-like behaviour, while base models (which are also usually released) are more suitable for text completion than for talking to.

Note: If you want to find pre-quantized versions of LLMs (or other types of popular models for things like image/video/music gen), you can look for them on Hugging Face (look for the quantization node in the model tree on the right of the official model page that you can navigate to through search). There will usually be instructions on how to run them in the quantized models’ model cards, but if you get stuck, your favourite search engine/LLM is your friend. The most popular formats for LLMs are GGUF (llama.cpp), EXL2/EXL3 (ExLlamaV2/3) and AWQ. When in doubt, I’ve found that llama.cpp quants are quite reliable and support the broadest range of hardware (though to squeeze out performance, you may want to look at other formats too). Do remember to read the section on general notes, it should help too. On that note, I’ll also point out that most of the blog post below is an overview - the field is way too large to do justice to in a single blog post, so don’t be afraid of diving deeper on your own (and if possible, give back to the LLM hobbyist community by participating and contributing!).

More about inference and a tiny bit of quantization

Earlier, we talked about using 32-bit floats for each parameter of LLMs. Let’s pause to think - do we really need the whole 32 bits?

LLMs these days are trained via what is called mixed-precision training - floating point numbers for different parts of the network are in different precision - FP32 (32 bits), FP16 (16 bits), bfloat16 (16 bits but larger exponent and smaller mantissa) and various FP8 (8 bits) variants like E4M3 and E5M2. If models can be trained in these precisions, we surely can use them for inference in that precision too. In fact, most open-weights LLM releases do have a majority of their parameters in 16 bit precisions (usually bfloat16), which already cuts down our original estimate to half its value. And some models (like Deepseek R1 671B) are trained mostly in FP8 precision, which cuts it down by another factor of half.

Remember, VRAM is the most expensive part in your build, so the less you have to buy, the better. And in a hybrid build, the larger the part of work that is offloaded to the GPU, the better. (And FP8 is not supported on some of the cheaper/older GPUs).

So how do we minimize GPU costs? GPU costs also depend on factors like the amount of VRAM, the memory bandwidth (how fast you can stream memory) and the amount of compute (how many operations per second) you get from them. Here’s a fact: with common GPU configurations, it is the first two that matter if you are using LLMs in a standard manner - 1-2 inference sessions (like chat sessions) without unreasonably large prompts.

The reason is that transformers are highly parallel (which is one of the things that makes them easy to train), so you can throw a lot of compute at them, to the point that the compute becomes the dominating factor during training. “Prompt-processing” is thus highly parallel, and as long as you don’t do super large prompts, you won’t need a ton of compute. The same goes for batching in token generation - if you batch together multiple prompts and generate tokens, you can use more compute (with the same parameters from the LLM), increasing your throughput.

However, for local inference, where the most common usecase is to just use a batch size of 1-2, suddenly you are bottlenecked by the speed of the memory that needs to be moved around in the GPU. As a consequence, people using GPUs for local LLM inference try to find as much VRAM in GPUs as cheaply as possible, given a certain baseline amount of memory bandwidth and GPU support (right now, if you use non-Nvidia GPUs, you might end up with limited software support for a bunch of applications, but for local LLMs, there are choices where some non-Nvidia GPUs work - llama.cpp has Vulkan support that lets you do inference on a wide variety of GPUs, though I haven’t looked into this too much). This makes Nvidia RTX 3090s with 24GB of VRAM some of the highest value for money (and hence scarce) cards in the market for large local LLM inference purposes. (It is possible to run smaller LLMs with smaller GPUs too, so don’t worry if you can’t afford/don’t want to buy an expensive one).

Note that this is not the case for other types of architectures - for example, diffusion models (usually used in image/video generation) are compute-intensive, and it will take less time doing inference on a 4090 than on a 3090 for most image generation models.

Coming to why VRAM (and having a GPU in general) is so important, CPUs have much less compute than GPUs, and most normal RAM setups have much smaller memory bandwidth (around an order of magnitude) than GPU VRAM. Even if you use something like 8-channel DDR5 RAM (which is faster than DDR4 RAM), for which you need Xeon/Epyc server setups, you will still only match GPUs in memory bandwidth. Compute is still a bottleneck for prompt processing, so you’ll end up with low prompt processing speeds, which can be a bottleneck for conversations with long system prompts and tasks like document processing or RAG. If GPU VRAM is insufficient or too expensive, people use hybrid inference to offload some of the layers to CPU instead of keeping everything on the GPU - this is generally useful up until a small fraction of the layers of the network, because of how slow CPU inference is compared to the GPU.

So the natural question is, can we make these models even smaller, with minimal loss in quality? That’s what we’ll talk about below. Quantization makes models smaller, and as a concept it helps in both inference at scale as well as locally.

Quantization

As the name suggests, quantization means taking something (relatively) continuous and mapping it to a discrete set. In our case, we map a high-precision floating point value to a low-precision floating point value, or a small range integer value.

For example, let’s consider a model that was trained in bfloat16 precision for the sake of simplicity. A bfloat16 floating point number has 1 sign bit, 8 exponent bits, and 7 mantissa bits. The mantissa is the number of digits after the “decimal” point in scientific notation, so if we were dealing with base 10, and the mantissa was 12300000, exponent was 00000002 and sign bit was 0, then this would represent the number $(-1)^{0} \times (1 + 0.123) \times 10^{2 - \text{shift}}$ where $\text{shift}$ is some number that ensures that both very large and very small positive numbers can be represented. In binary, we have the same thing but for base 2. bfloat16 is just a shortened FP32 that we get by compromising on the mantissa (and hence the precision or the gap between two representable floating point numbers).

It turns out that LLM parameters don’t really need the full range of the type they are trained in. Only intermediate computations require precision. In fact, recent research shows that you need, on average, around 11 bits per LLM parameter, instead of the full 16, for lossless compression. Quantization is much more aggressive though, with little to no loss of quality on everyday tasks down to as few as 4 bits of precision, as validated by scaling laws for precision. As your LLM is quantized to use fewer and fewer bits, quality degradation is expected (since the LLM is now being more approximate compared to how it was trained).

There are generally two ways of getting a quantized model out of a freshly trained model - quantization aware training (QAT) or post-training quantization (PTQ). QAT requires some fine-tuning on a subset of the training data after the model has been trained, where the quantization parameters are optimized as well. For example, one way to do it would be to make the forward pass inference quantized, but the gradients are used with full precision. However, doing this at scale at the third-party level (people who quantize models are generally not those who trained them) is impractical in a world where LLMs are released every few days and people want to immediately try them out, so PTQ wins over QAT in availability. Notably, Google released QAT versions of the Gemma 3 model family which have quite decent performance.

Let’s try to get a feel of how this can be done. For a number $x$, where the allowed error between consecutive numbers is $s$, one way is to replace it by $\text{round}(\frac{x - z}{s})$, where $x$ is the value to be quantized, $s$ is the scale, and $z$ is the zero point (usually $0$ for symmetric cases). In case s is too small, we would also need to clip these values to get them within the allowed ranges for the datatype. At inference, we can try to approximately invert this by treating it as if the rounding/clipping were not there in the first place, i.e., by multiplying the quantized weight by $s$ and adding $z$.

This is generally how we get quants (quantized models/methods are usually called “quants”). Note that memory layout is quite important for sub-byte quantization since memory on most systems is byte-addressable, not bit-addressable (so in order to access a small 4-bit chunk of memory, you need to access the whole byte), but I will not discuss that (or other implementation details) for the sake of simplicity here.

Different quantization methods tend to have different things that they optimize for (like error in weights, errors in layer outputs and so on), alongside the methods they use for them. However, there are a few metrics that are helpful for determining the quality of quants: for example, perplexity increase on a certain reference dataset relative to FP16 or original weights (perplexity is a metric that tells how surprising/novel a piece of text is for a language model, so lower is better), the KL divergence of logits (sometimes activations) relative to that for FP16, some statistics (min/max/mean/median/quartiles) for deviation from FP16 weights/activations and so on.

We’ll go through different quantization methods one by one.

llama.cpp - *_0

llama.cpp quants with abbreviations Q<x>_0 have most parameters compressed to $x$ bits (other than the embedding layer, which is stored in FP16 on the CPU). These are centered, i.e., in any given block/row of floats, we find the float with the maximum absolute value, and then divide that by $2^x - 1$ to get the denominator. This ensures that the numbers are between $-2^x - 1$ and $2^x - 1$. So we need to store the denominator/inverse-denominator (which gives an overhead of 2 bytes, or smaller if quantized). As we reduce the block size, our accuracy gets better, but we also have to store more values. There are some other deprecated quants like Q4_2 which use a smaller block size. These quants should generally not be used in favour of newer quants.

llama.cpp - *_1

One issue with the above quants is that there can be unnecessary spacing between weights. Doing a min-max style normalization (take the range of numbers instead of the largest number, and shift the inputs) provides much better gaps between the numbers in case there is asymmetry in the smallest and the largest weights. This gives us numbers from $0$ to $2^x - 1$, but we need to store 2 quantization parameters - min and scale instead of a single parameter. These parameters can be quantized as well if required. There are some other deprecated quants like Q4_3 which use a smaller block size. These quants should not be used in favour of newer quants.

llama.cpp - *_K

One issue with the above idea is that as the bpw (bits per weight) decreases, the overhead starts becoming much more prominent. So one solution is to have a hierarchical block structure, where for each group of blocks (of some specified size), we store a higher precision scale variable, and for each of the smaller blocks, we store a low precision scale variable (the final scale is the product of the low and the high precision scales). This effectively shaves off unnecessary bits of information from the scales.

There are also quants that look like Q5_K_M or Q4_K_S or Q3_K_L. This is because K quants will generally have layer-specific quantizations for different parts of the network. It turns out that some weights are more sensitive to quantization than others.

These quants are fine if you need 6 bit quantization and above. For more information on llama.cpp quants, I highly recommend going through ikawrakow’s work on the llama.cpp and ik_llama.cpp repositories.

llama.cpp - imatrix

When we go to smaller bpw ranges, choosing which weights to prioritize becomes more and more important. imatrix quants do precisely this. The importance of each weight is calibrated on a dataset, and prioritization happens according to those weights. More precisely, the imatrix-weighted RMSE of quants with respect to the original weights is minimized. This is currently done heuristically in two steps - find a rough value for the quants of the given weights using some scale estimate, then use those quants to find an exact solution for the optimum scale needed, and then search over a range near the scale to find the best quants. The imatrix entry itself is an accumulation of the input activations to the weight, taken over the dataset.

Whether quants use imatrix or not is somewhat hard to tell, but the most popular quants uploaded by people/organizations like bartowski or unsloth on Hugging Face are imatrix quants. You can usually find both non-imatrix (static) and imatrix by mradermacher on Hugging Face.

Generally, imatrix variants of quants should be used by default. They are supported for both the k-quants as mentioned above, and the i-quants that will be mentioned below.

llama.cpp - IQ_*

imatrix and IQ_* quants were both introduced at the same time, for the purpose of improving 2 bit quants, which sometimes leads to confusion in their purpose, but they’re separate things. It’s worth noting that these quants (IQ_* with imatrix) should be generally used below 4 bit quantization.

IQ_* picks up a couple of ideas from the QUIP# paper - using codebooks (lookup tables) and forcing the parity of negative quants in each 8-sized block to be even (flip the sign of the least important quant using the imatrix). The idea is to first encode the signs of each quant (takes 7 bits), and then try to use the codebooks to lookup the closest magnitudes for each weight (we have 9 bits remaining, so a codebook of size 512 is possible). The codebook will consist of 8-tuples with entries from $\{1/2, 3/2, 5/2\}$, and we use one bit to check whether adding a 1/4 or subtracting a 1/4 gets us closer to the original weights. There are 256 bits of information left, and each of them can correspond to an 8-tuple. The QUIP# paper uses all points within a sphere plus some handpicked points, but the IQ_* codebook is constructed more empirically, by brute forcing over all models - perform quantization using all possible lattice points, then count how many times each point occurred, and the total count of the selected points is maximized, and the maximum distance of the unselected points to the closest selected point is minimized.

In later quantization schemes like IQ4_XS, which is one of the most-used quants, the parity-forcing is sometimes relaxed, and the process of constructing a codebook is much simpler - the codebook is implicit in the sense that instead of any rounding or explicit lookups, there is a static range-based mapping to the quants (i.e., you pick one element out of an array that looks like [-8, -4, -2, 0, 2, 4, 8], for each weight). This allows for more non-linear behaviour - for example, the inter-quant spacing near 0 is much smaller than the inter-quant spacing near the edges of the range.

One reason why this differential spacing is good is that obscure/tail knowledge is sometimes encoded in outliers in weights, so in a linear scaling + rounding, outliers drown out the signal from general weights (which end up being mapped to the same value despite the range having more information in it). The specific non-linear encoding here makes things more fine-grained for the non-extreme weights, so more knowledge can be preserved during quantization.

These quants are generally somewhat slower than the k-quants, but given that most people are bottlenecked by RAM/VRAM instead of time, it is generally a good idea to just use these quants whenever possible, over similar sized k-quants.

llama.cpp - UD_*

These are Unsloth’s Dynamic v2.0 quants. The main difference from the previous methods is that each layer is quantized separately based on the sensitivity of each layer to quantization. People mention that these quants and IQ_* quants are similar in quality (with dynamic quantization sometimes winning out), but I haven’t experimented with them, so feel free to experiment.

Quantizing the KV cache

While not quantizing the weights itself, if we account for the total memory usage for transformers specifically, we see that it keeps growing as the context length (number of tokens the LLM is attending to in a session) grows due to the Key-Value Cache (KV Cache in short). To save on this memory, we can quantize it using methods like FP16 (default in llama.cpp) or one of the methods like *_0 or *_1, for instance. The k-cache is generally more sensitive to quantization (severe performance degradation below Q8), while the v-cache is not that bad (Q4_0 is still usable, but Q8_0 is still recommended). The idea is simple - we want the cache to be as small as possible in order to squeeze in more context.

GPTQ

This is a precursor to most quantization methods we’ll talk about, and is generally not used widely. However it had a bunch of ideas which were adopted by other quantization schemes. It also supports 2, 3 and 4 bit quantizations apart from the standard 8 bit. It quantizes weights layer by layer by minimizing the L2 norm of the difference in layer outputs before and after quantization on a given dataset, and it does so using Hessian information (and deals with instability issues using a Cholesky decomposition). It dynamically dequantizes weights to FP16 during inference. It’s mainly meant to be used on GPUs, but has some limited support for CPU inference. I recommend reading the GPTQ paper for a more detailed description.

AWQ

AWQ was built upon the idea that some weights are better left unquantized, and these are of the order of 0.1% of all weights in the model. They initially tried a few things like not quantizing the outliers and so on, but then realized that they should not quantize the weights that instead have the largest and most significant activations. There are hardware-related practicalities that don’t let a naive implementation work very well, so a couple of ideas help - first, rather than a single weight, they “mark” salient weights channel-wise, i.e., the whole channel is marked as non-quantizable. Secondly, if we store these salient weights in mixed precision, since the location of these weights would be quite random, it leads to implementation issues - to fix this, weights are scaled in a manner that takes into account input activation scales and pre-scales weights by activation input scales in that specific input channel; these scales are found via solving an optimization problem similar to GPTQ. This does require a bit more VRAM compared to GPTQ for instance. It’s also somewhat tightly integrated with GPU inference as of now, so remember to use this only if you can fit everything on the GPU. I recommend reading the AWQ paper for a more detailed description.

bitandbytes (BNB)

BNB is a format that allows for more than just inference - there is integration for doing INT8 or NF4 (normalized floating point with 4 bits) inference/QLoRA based finetuning, which keeps computations in 16/32 bits but weights and activations in lower precision. This allows for inference as well as finetuning in 4 bits with quite competitive results. It has a lot of functionality integrated with Hugging Face transformers, so it’s commonly used for experimentation and finetuning. For more details, you should probably read the LLM.int8(), 8-bit optimizers, and QLoRA papers from the people who developed BNB.

EXL2

EXL2 is a quantization format supported by ExLlamaV2 that is based on GPTQ and supports 2, 3, 4, 5, 6 and 8-bit quants. Quite importantly, this supports mixing quantization levels within linear layers (and across layers in the model) that allows more important weight channels to be quantized with more bits (unlike in AWQ).

From the GitHub README: Parameter selection is done automatically by quantizing each matrix multiple times, measuring the quantization error (with respect to the chosen calibration data) for each of a number of possible settings, per layer. Finally, a combination is chosen that minimizes the maximum quantization error over the entire model while meeting a target average bitrate.

If you’re using a GPU that can fit your whole LLM in its VRAM, and there is no EXL3 support for your LLM of choice so far, it is highly recommended to use EXL2.

EXL3

This is an update to EXL2 with SOTA quantization methods. It is still being developed at the moment (and if you’re interested, you should definitely help them out), so there are a bunch of features that will be added over time, like multimodal support, full tensor-parallelism, speculative decoding and better performance on Ampere GPUs (3090s for example). The main changes on top of EXL2 are some ideas taken from the QTIP paper and adapted for EXL2, as well as some QoL improvements for multi-GPU setups.

To understand QTIP, we’ll need to review a bit of QUIP#: QUIP# uses a randomized Hadamard transform to approximately make the matrices in question Gaussian (instead of using a randomized orthogonal matrix which is more expensive and doesn’t have as nice theoretical properties), as well as a multidimensional codebook instead of a scalar codebook (which is what normal quantization methods do with weights one by one). The benefit of a multidimensional codebook is that you can “sample” the lattice points from a distribution you know the weights will come from, and by increasing the dimension and choosing more lattice points, you can impose stronger priors on what the distribution of weights should look like, and if your priors are right, you would need to store fewer parameters for the same quality (as your imposed priors do the remaining heavy lifting) - think of it in terms of how much information you store in the weights and how much information you already know about the weights from statistical analysis of the original weights, which is what you’re encoding in your priors. Since we are assuming that the weights we use after the “incoherence” transform are Gaussian, it makes sense to have as large a dimension as possible without breaking the Gaussian assumption. The issue with QUIP# is that the number of lattice points chosen grow exponentially with the dimension.

So instead of vector quantization, QTIP tries to find a better mechanism to encode the weights. In information theory terms, we have a Gaussian information source (the one that generates the weights, independently), and we want to encode that information in a simple model. In order to scale to higher sequence lengths well, we would need the codebook for a good compression scheme to be as small as possible, while also being expressive enough that it can represent long sequences that are sampled from a given space. Do we know of ways to generate sequences of numbers whose counts explode combinatorially? Markov chains are one such mechanism, though somewhat ironically, they do imply a dependence between adjacent elements. However, if we restrict ourselves to Markov chains with a large enough state space, it is easy to come up with Markov chains with even a relatively regular pattern in the set of allowed transitions that can exhibit a given sequence. TCQ does precisely this - it labels the nodes of the Markov chain (with a pre-determined structure) with high-precision floating point numbers. The implementation details are what make it feasible to run on hardware - a “bitshift trellis” (equivalent to a Markov chain with an edge from i to j if the lowest k bits of i and highest k bits of j match) allows parallel decoding, and they avoid storing the whole codebook by using a pseudorandom number generator (that can be implemented on a GPU fairly easily) that takes in input an index and a small lookup table.

Fun fact: TCQ is inspired by TCM (trellis coded modulation), which was an important development in telecom because it helped in transmitting more information over bandwidth-limited channels like telephone lines, which also requires compressing information as much as possible.

In any case, this is highly recommended if the models you can find quants for are already supported by ExLlamaV3, and the quants (with the KV cache) fit in your VRAM.

Some general notes

There are quite a lot of other quantization methods, but I’ve only included ones that are in general use (enough for the most popular LLMs to have pre-made quants for them in those formats) in favour of a shorter blog post. You can look at Hugging Face documentation and vLLM documentation for more information on other types of quantization. This would be especially useful if you want to do inference on more modern GPUs (for instance, Nvidia’s TensorRT Model Optimizer allows you to quantize models up to FP4 for inference on Blackwell GPUs).

In general, you will also find the r/LocalLLaMA subreddit to be a great source of practical information, which is also where some of the information in this post comes from.

Note that most LLM providers don’t serve models with sub-8-bit quantization. For running things locally, it’s generally best to pick the highest quantization which fits in your VRAM along with the KV-cache (or mostly fits on VRAM for llama.cpp quants).

Bigger models and undertrained models can generally survive heavier quantization than smaller models, so do remember that while looking for models. For example, Llama 3.1 and Qwen 2.5 onwards, almost lossless quants were around 6 bits, with 4 bit quantizations being off from the 16 bit versions by a fraction that was noticeably larger than their older counterparts, and people report larger LLMs (of the order of 32B or 70B onwards) to be usable with 4 bit or below quantization schemes. How much of that is selection bias is debatable, but if you have low VRAM, you generally don’t have a choice, and larger LLMs are generally better even at lower quantizations compared to ~lossless smaller LLMs that are 2x smaller.

If you’re running an LLM locally, you would either choose one of llama.cpp, EXL2 or EXL3 quants. People use AWQ due to its integration into vLLM too, but that takes up more VRAM. If your required quants can fit completely into your GPU’s VRAM, use EXL3 for higher quality at low bitrate, and EXL2 otherwise. Note that ExLlamaV2 currently has more features than ExLlamaV3 so that can be a deciding factor for you. If you want to ensure that your quants run on CPU-only/CPU+GPU/GPU-only systems, you should go for llama.cpp quants. llama.cpp has a fork called ik_llama.cpp which has much better performance in hybrid/CPU-only cases, but fewer features due to having much fewer contributors compared to llama.cpp, and is where a lot of the new open quantization research is happening. So if you don’t mind the lack of features (which is probably the case for you if you just want to chat with LLMs), forks like these should also be considered in relevant setups. Also, you should consider turning on flash attention (for GPU usecases) wherever possible - it reduces VRAM usage significantly and also leads to faster generation in most cases. It can sometimes be quite noticeably buggy, though, so do check before using it.

If you have extra space lying around on your GPU, and the (large) LLM you want to run has a smaller counterpart (usually happens within model families), you can sometimes use speculative decoding (supported in multiple backends) for using the smaller one as a draft LLM that can suggest completions, and if they are accepted by the bigger model at a sufficiently good rate, it can improve your token generation speeds significantly.

In case you have an Apple machine or a machine with unified memory, it is much faster than normal RAM, so it does provide token generation speeds that are much better than traditional RAM (but slower than GPUs in both token generation and prompt processing as of now). I haven’t looked into most systems that have a unified memory (like the “AI PCs” being marketed as LLM inference machines), but Apple machines have a good ecosystem called MLX, and models can be found here. I would expect llama.cpp to work with most platforms though, so if you have hardware that is not supported on llama.cpp yet, you’re probably out of luck (unless there are obscure packages for doing LLM inference on your specific hardware, which is unlikely).

You can also consider running MoE (mixture of experts) models (like Qwen 3 30B-A3B, Qwen 3 235B-A22B, DeepSeek V2/V2.5/V3/R1, Mixtral 8x7B, Mixtral 8x22B, WizardLM 8x22B) if you have enough VRAM to fit all the parameters (sometimes it is enough to have enough VRAM to fit most of the active parameters). The advantage with these is that since single batch inference is memory-bound and not compute bound, the fact that only the active parameters (3B in Qwen 3 30B-A3B, or 13B in Mixtral 8x7B and so on) are needed for computation brings token generation speed close to that of a dense model (like Llama 3.2 3B) with as many parameters as the given model has active parameters. If you don’t have enough VRAM to fit even a significant number of parameters, you can consider offloading certain tensors to the GPU in llama.cpp/ik_llama.cpp (for example, I use --override-tensor "\.1*[0-8].ffn=CUDA0,[2-3][0-8]=CUDA1,ffn=CPU" for Qwen3 235B-A22B IQ3_M on llama.cpp), or use ktransformers (which can give some pretty good speedups, but with smaller coverage of models). As a rule of thumb, MoE models, generally have roughly similar world knowledge to dense models with the same parameter count, but they do fewer computations and are hence generally worse at reasoning (this is only a vague guideline). Another rule of thumb had been that their performance is generally similar to a dense model with roughly $\sqrt{TA}$ parameters where $T$ is the total parameter count and $A$ is the number of active parameters, though there has been some doubt cast on this assumption lately with the release of Qwen 3.

As far as doing surgery on a model is concerned, quantization is just one method of changing a model in a potentially low cost manner. There is stuff like pruning (with the Nemotron series from Nvidia being pruned from Llama models and even Llama 3.2 3B being pruned from Llama 3.1 8B - Nvidia has support for this in TensorRT Model Optimizer), merging model weights (merging any of the multiple possible models with the same architecture - the base model, instruction tunes and custom fine-tunes and so on - done using mergekit these days), merging/pruning experts in MoE models, activating more parameters, finetuning using LoRAs/QLoRAs etc. This is quite a big area of research and doesn’t require as much capital as fully training a model.

As usual, please let me know if you have any comments on the post!

Motivation#

More about inference and a tiny bit of quantization#

Quantization#

llama.cpp - *_0#

llama.cpp - *_1#

llama.cpp - *_K#

llama.cpp - imatrix#

llama.cpp - IQ_*#

llama.cpp - UD_*#

Quantizing the KV cache#

GPTQ#

AWQ#

bitandbytes (BNB)#

EXL2#

EXL3#

Further reading#

Some general notes#