Ollama is great for running a local self-hosted AI REST API. You can load all kinds of models, whether they’re for chatting, vision, or embedding. At this time, I’m not aware that voice models can be used for input or output, but that might only be a matter of time.
Why Ollama
Why would you like to run Ollama? First of all, the models you can load are likely less powerful than what you can consume from the OpenAI API or Anthropic API. However, you can fully locally process, gaining the privacy of not having to process in a US datacenter but rather it does not even have to leave your house. Additionally, if you need to process massive amounts of data and the precision is sufficient from the open models, you might have to wait longer for all to be processed, but using public APIs can also rack up vast amounts of cost to do so. Even by error, you might burn through $80 in no time when you missed a error catch in your code. In the Ollama case, you only have your upfront cost for the device and the energy burned. But then you have no sudden surprises. But also your processing is of course limited; you can’t massively parallel process. So it highly depends, but there is a sweet spot to solve issues on your own machine.
JSON structured output

In December 2024, Ollama also followed up with structured output, which is a great addition. It means the model can be asked to produce clearly defined JSON-structured answers.
Example request
curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "Tell me about Canada."}],
"stream": false,
"format": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"capital": {
"type": "string"
},
"languages": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"name",
"capital",
"languages"
]
}
}'
Example output
{
"capital": "Ottawa",
"languages": [
"English",
"French"
],
"name": "Canada"
}
This is something ChatGPT 4 mini introduced also in August 2024.
The Ollama API is now way more advanced than offerings like those from LM Studio. So if you don’t care about the UI, Ollama is a much better choice than Jan or LM Studio.
Personally, I use AI models pretty much every hour of my regular day. They are so helpful. I mostly use them for coding and solving issues. I’m a huge proponent for automation and scripting. Use cases for automation are so powerful; I can’t overstate how it changed my day.
Running Ollama on old Hardware
In the beginning, I wanted to run Ollama on an old HP MicroServer, but this was not possible at all because the CPU requirements of my +10-year-old machine could not nearly match even the most basic minimum requirements. So running Ollama on old, outdated, underpowered machines or also on a Raspberry Pi is probably not even worth a try.

https://www.seanmcp.com/articles/running-ollama-without-a-gpu
If you have older hardware that was maybe a gaming PC having a dedicated GPU can help. I started using older hardware with an NVIDIA GeForce GTX 1060 with 6GB of video RAM. This machine is powerful enough to run some basic models with Ollama installed with the native windows client.
Models i can run on this PC
- falcon3:10b, Q4
- mistral:7b, Q4
- llama3.2-vision, Q4
- llama3.1, Q4
- llama3.2, Q4
- qwen2.5, Q4
- bge-m3, Q4
Running Ollama on M-Series Mac
What can be really interesting is running AI on a modern Mac system like a Mac Mini M4 Pro. The Pro can be interesting for the higher bandwidth. The M-Series processors offer shared memory between CPU and GPU. To run large models it’s very important to have a vast amount of VRAM (Video RAM) to load the models. This is something the new M-Series Macs offer as they have shared memory for normal RAM and video RAM and therefore are specially suited to run larger models where a regular GPU might become very pricey. However, you will likely be limited to medium models; the very large models require a lot of power to load and run. Not least, you also need power to output reasonable output speeds (often also called tokens).
For example, on a Apple Mac mini – 2024 M4 Pro, 48GB, 1TB SSD (Z1JV), you can run the following models. This is just a list with no means of completeness. Rather more an example.
- gemma2:27b-instruct-q8_0
- llama3.2-vision:11b-instruct-q8_0
- llama3.1:latestcodegemma:7b-instruct-q8_0
- qwen2.5:32b-instruct-q8_0
- qwen2.5-coder:32b-instruct-q8_0
- mistral:7b-instruct-v0.3-q8_0
- bge-m3
Ollama model library
The Ollama library usually has a preselected model style. If you are looking for Gemma2, you see the option to download and run it like this:
ollama run gemma2:27b
Doing this is often not the best option. If you click “View All,” I can tell you why.

If we look up the Gemma 2 27B model with the hash 53261bc9c1920, then we can see that ‘standard’ pre-selected model is the instruct model. This is good as it is meant for interactions in a chat style. But we chose a quantization of 4. Quantization is a form of compression, often from Q1 = most compressed to Q8 = least compressed. So for better results, Q8 is preferable over Q4. This is why I would recommend you switch to the Q8 model type if you have the hardware that can run it.

This is an example of how you would send a command to Ollama to pull a Q8 version of a particular model.
ollama pull gemma2:27b-instruct-q8_0
ollama pull mistral:7b-instruct-v0.3-q8_0
ollama pull qwen2.5:32b-instruct-q8_0
ollama pull qwen2.5-coder:32b-instruct-q8_0
ollama pull llama3.2-vision:11b-instruct-q8_0
I found this topic on Reddit where they however do not recommend using Q8. But as I try to get the absolute maximum out of a model at times I still try to run Q8, but will also consider Q6_K.
Old quant types (some base model types require these):
- Q4_0: small, very high quality loss – legacy, prefer using Q3_K_M
- Q4_1: small, substantial quality loss – legacy, prefer using Q3_K_L
- Q5_0: medium, balanced quality – legacy, prefer using Q4_K_M
- Q5_1: medium, low quality loss – legacy, prefer using Q5_K_M
New quant types (recommended):
https://www.reddit.com/r/ollama/comments/1f9mx9n/why_does_ollama_default_to_q_4_when_q4_k_m_and_q4/ (2025-01-15)
- Q2_K: smallest, extreme quality loss – not recommended
- Q3_K: alias for Q3_K_M
- Q3_K_S: very small, very high quality loss
- Q3_K_M: very small, very high quality loss
- Q3_K_L: small, substantial quality loss
- Q4_K: alias for Q4_K_M
- Q4_K_S: small, significant quality loss
- Q4_K_M: medium, balanced quality – recommended
- Q5_K: alias for Q5_K_M
- Q5_K_S: large, low quality loss – recommended
- Q5_K_M: large, very low quality loss – recommended
- Q6_K: very large, extremely low quality loss
- Q8_0: very large, extremely low quality loss – not recommended
- F16: extremely large, virtually no quality loss – not recommended
- F32: absolutely huge, lossless – not recommended
Ollama remote server
If you want to connect remotley to your Ollama machine default binding is only on localhost any you need to switch to 0.0.0.0 OLLAMA_HOST to allow incoming connection on all interfaces.
This is not needed if you want to connect to Ollama on the same machine.
Windows
You also need to allow incoming connections to 11434 TCP.
# Set the variables
$ollamaHost = "0.0.0.0"
$ollamaModels = "llama3"
# Set OLLAMA_HOST
[System.Environment]::SetEnvironmentVariable('OLLAMA_HOST', $ollamaHost, [System.EnvironmentVariableTarget]::User)
# Set OLLAMA_MODELS
[System.Environment]::SetEnvironmentVariable('OLLAMA_MODELS', $ollamaModels, [System.EnvironmentVariableTarget]::User)
# Confirm the changes
$currentHost = [System.Environment]::GetEnvironmentVariable('OLLAMA_HOST', [System.EnvironmentVariableTarget]::User)
$currentModels = [System.Environment]::GetEnvironmentVariable('OLLAMA_MODELS', [System.EnvironmentVariableTarget]::User)
Write-Host "OLLAMA_HOST has been set to: $currentHost"
Write-Host "OLLAMA_MODELS has been set to: $currentModels"
Write-Host "Please note that you may need to restart your PowerShell session or applications for the changes to take effect."
If you have the server running but this message
$body = @{
model = "llama3"
prompt = "Tell me a fact about Llama?"
stream = $false
} | ConvertTo-Json
$timeout = New-TimeSpan -Minutes 15
Invoke-RestMethod -Uri "http://100.113.244.108:11434/api/generate" -Method Post -Body $body -ContentType "application/json" -TimeoutSec $timeout.TotalSeconds
Invoke-RestMethod : {"error":"model 'llama3' not found, try pulling it first"}
At line:9 char:1
+ Invoke-RestMethod -Uri "http://100.113.244.108:11434/api/generate" -M ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebException
+ FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand
you might want to check with ollama list
if the model ist available.
MacOS (Sonoma & Sequoia)
You need to run this command in the terminal and then restart Ollama.
launchctl setenv OLLAMA_HOST 0.0.0.0:11434
Download Ollama
If you are hooked, you can run Ollama either natively installed on your system or run it within Docker. I went with a native install to reduce some overhead, especially as I have very limited resources available on my computer and it seems I have slightly more resources available with the native install over the Docker one.