Categories
AI

What i learned about Ollama and Models

While I’m aware those models aren’t limited or bound to Ollama, is Ollama still the way I interface and use them? Here I try to keep notes on how I use certain models and what I like or dislike about them.

This is purely a subjective view, there are many attempts at objectively measuring the performance of models. This is not it.

ModelVRAM SizeQuantizationRemarks from my personal experience
gemma3:27b20627.03 MBQ4_K_M📕 When using Gemma 3 to correct my English, I see that the text it creates, while grammatically correct, no longer means the same thing. Therefore, I prefer Gemma2 for correcting English as it does it well and preserves the content and writing style more.
👁‍🗨 Supports vision
gemma2:27b19903.46 MBQ4_0📕 Very good in fixing English or German texts.
⭐This is my most used and favorite model at the moment.
mannix/gemma2-9b-simpo:latest9080.95 MBQ4_0📕 Very good at translating into English and fixing English.
qwen3:32b24773.64 MBQ4_K_M
qwen3:30b-a3b20293.31 MBQ4_K_MFor fast answers
deepseek-r1:32b22385.44 MBQ4_K_M
mistral:7b6095.05 MBQ4_0📘 An efficient and cheap model that stays truer than others when summarizing texts when processing time is an issue. I did summarize over 300’000 text samples in around ~400 hours.
falcon3:10b8176.71 MB8176.71 MB
llama3.1:latest6942.47 MBQ4_K_M
llama3.2:latest4090.59 MBQ4_K_M
llama3.2-vision:latest11696.27 MBQ4_K_M👁‍🗨 Supports vision
devstral:latest16035.07 MBQ4_K_MDevstral is built under a collaboration between Mistral AI and All Hands AI 🙌
Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM, making it an ideal choice for local deployment and on-device use.
Osmosis/Osmosis-Structure-0.6B:latest2,905.68unknownA specialized small language model (SLM) designed to excel at structured output generation
bge-m31690.55 MBF16🎯 This is an multi-language embedding model the model of choice for my RAG. I usually combine it with Qdrant
⭐This is my most used and favorite embedding model at the moment.
x/z-image-turbo:latest21,504.00MBN/AZ-Image is a powerful and highly efficient image generation model.
translategemma:27b
translategemma:12b
translategemma:4b
28,968.22MBQ4_K_MA new collection of open translation models built on Gemma 3, helping people communicate across 55 languages.
r1-1776:70bA version of the DeepSeek-R1 model that has been post trained to provide unbiased, accurate, and factual information by Perplexity.
functiongemma:270mFunctionGemma is a specialized version of Google’s Gemma 3 270M model fine-tuned explicitly for function calling.
benhaotang/Nanonets-OCR-s:latestOCR model by Nanonets that excels at turning anything into markdown
gpt-oss-safeguard:20bTrained and tuned for safety reasoning to accommodate use cases like LLM input-output filtering, online content labeling and offline labeling for Trust and Safety use cases.
phi4:14bPhi-4 is a 14B parameter, state-of-the-art open model from Microsoft.
granite3.2-visionA compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more.
qwen3-vl
qwen3-coder-next:q4_K_MSadly too big for my AI machine. It works but too much swapping.
Deleted on April 20th, 2026.

Reflection on what I run on Ollama

In this section, I will reflect on my actual usage. Read more about this graphs.

Week 18 (2025)

I’m a user of all kinds of AI, including OpenAI ChatGPT or Anthropic Cloud. Potentially soon also Grok API via Azure. But for some data I process locally as I don’t want to send the data to the APIs of those companies, as well as for my personal learning journey. We see this week large usage of Mistral:7B. I like to use Mistral:7B summarization when power consumption is an issue. As I was processing a hundred thousand entries, I was using Mistral:7B over Gemma2:27B for keeping the accuracy relatively well while being able to process in OK time on my Mac.

During week 18 (2025) (Apr 28 – May 4), Ollama processed 11 model sessions with a total runtime of 7 hr, 4 min across 8 different models.

Week 19 (2025)

I’m a huge fan of gemma2:27b and was excited with the release of gemma3:27b. However, after testing and comparing both models, I like gemma2:27b because on my summarization and translation workflows it proved to create results closer to what I expected. Gemma3:27b did a lot more “creative” changes and even changed the context in unintended ways.

During week 19 (2025) (May 5 – May 11), Ollama processed 58 model sessions with a total runtime of 4 hr, 37 min across 8 different models.

Week 20 (2025)

I more or less abandoned my summarization and translation workflow attempts with gemma-3b:27 and switched back to gemma-2:27, which is a kickass model. I also updated the Qdrant vector database with new content, where my go-to model is bge-m3:latest.

I was also exploring the new model qwen-30b-a3b, but couldn’t find a use case for it. But always good to see and try.

During week 20 (2025) (May 12 – May 18), Ollama processed 63 model sessions with a total runtime of 1 day, 9 hr across 5 different models.

Week 21 (2025)

This was more or less a regular week with large summarization jobs running on my still favorite model, Gemma 2:27B.

During week 21 (2025) (May 19 – May 25), Ollama processed 34 model sessions with a total runtime of 1 day across 10 different models.

Week 22 (2025)

We see two new models on the list. I was playing with devstral:latest, a derivative of Mistral tailored for coding. I used it in conjunction with AllHands, which gives a chat interface to build your applications. This was my best outcome to date with local, limited processing code generation.

In general, however, we see that my regular summarization and translation workflow is still processing the bulk of the time using Gemma2.

During week 22 (2025) (May 26 – Jun 1), Ollama processed 31 model sessions with a total runtime of 16 hr, 28 min across 4 different models.

Week 23 (2025)

This was not a good week for my Mac. After a power outage from a storm, I lost access for about five days. Therefore, fewer workloads were processed this week, and nothing new could be tested. I was forced to use low-power Gemma 2 models and found that they are not up to the task. Below Gemma:12b is pretty much not usable for translation and summarization in my case. I’m happy my Mac is now working again.

During week 23 (2025) (Jun 2 – Jun 8), Ollama processed 7 model sessions with a total runtime of 1 day, 15 hr across 2 different models.

Week 24 (2025)

Just a very regular week, but also comparing some Gemma again. But it just can’t convince me.

During week 24 (2025) (Jun 9 – Jun 15), Ollama processed 39 model sessions with a total runtime of 4 hr, 59 min across 3 different models.

Week 25 (2025)

I did some experimentation in understanding tasks from tickets using reasoning models. My favorite for local hosting remains deepseek-r1:32b.

During week 25 (2025) (Jun 16 – Jun 22), Ollama processed 88 model sessions with a total runtime of 7 hr, 6 min across 11 different models.

Week 26 (2025)

Just a boring week.

During week 26 (2025) (Jun 23 – Jun 29), Ollama processed 52 model sessions with a total runtime of 15 hr, 58 min across 2 different models.

Week 27 (2025)

I didn’t have time and need to use an LLM this week.

During week 27 (2025) (Jun 30 – Jul 6), Ollama processed 2 model sessions with a total runtime of 10 min, 32 sec across 2 different models.

Week 28 (2025)

I didn’t have time and need to use an LLM this week.

During week 28 (2025) (Jul 7 – Jul 13), Ollama processed 1 model sessions with a total runtime of 5 min, 20 sec across 1 different models.

Week 29 (2025)

After the quiet two weeks, I rerun summarization and translation workflows to catch up on missed updates.

During week 29 (2025) (Jul 14 – Jul 20), Ollama processed 12 model sessions with a total runtime of 2 days, 1 hr across 1 different models.

Week 30 (2025)

It was a rather quiet week, I accumulated a large backlog of translations and summaries and did not have enough time to lower it. Also, I read about the newly released model Mistral-Small 3.2:24B but did not find a use case yet.

During week 30 (2025) (Jul 21 – Jul 27), Ollama processed 43 model sessions with a total runtime of 1 day, 12 hr across 2 different models.

Week 31 (2025)

I went on the hunt for a more efficient model for my use case of summarization and translation. I used LM Arena to determine users’ favorites with a lower parameter size. My current favorite model is gemma2-24b, now I switch to gemma2-9b-simpo in the hope of gaining similar results with less power.

https://lmarena.ai/leaderboard/text (2025-07-28)

This is the model I try to replace my high-volume workflows where I previously used Mistral:7B, then tried Gemma 2:27B which just took too long. Now I hope Gemma 2-9B Simpo will solve the issues here for me.

One benefit I noticed already is that when I send occasional competing work packages to the Ollama API, it’s less disruptive for these queries as the system finds a sport faster on Ollama between my bulk translation and summarization. Therefore, my other low-volume processes are less likely to time out.

During week 31 (2025) (Jul 28 – Aug 3), Ollama processed 42 model sessions with a total runtime of 3 days, 5 hr across 2 different models.

Week 32 (2025)

No experiments this week, only regular summarization and translation workflows will continue using the new model.

During week 32 (2025) (Aug 4 – Aug 10), Ollama processed 30 model sessions with a total runtime of 18 hr, 27 min across 4 different models.

Week 33 (2025)

OpenAI dropped an open model “gpt-oss.” I went out to get it in the 20B parameter version. They describe it as “OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases,” and I will try it out.

Week 50 (2025)

During week 50 (2025) (Dec 8 – Dec 14), Ollama processed 31 model sessions with a total runtime of 3 hr, 12 min across 4 different models.

I started implementing with MAKER. This is a very interesting approach I started using analyzing unstructured maintenance windows arriving in free text.

MAKER achieves reliability through decomposition, not scale. Rather than depending on larger models, it orchestrates networks of smaller ones using three interlocking principles:

Maximal Agentic Decomposition (MAD): Tasks fragment into their smallest meaningful units often one decision per agent. Each receives only essential context for its step. This isolation prevents error cascade, contains context drift, and makes corrections surgically precise.

First-to-ahead-by-k voting: Multiple agents tackle identical steps simultaneously. The system locks in whichever action first pulls k votes ahead of alternatives, creating fast local consensus. Modest per-step accuracy gains compound exponentially across thousands of steps, converting local agreement into system-wide dependability.

Red-flagging: Certain outputs betray confusion through telltale pattern excessive verbosity, malformed structure. MAKER preemptively filters these before they enter the vote pool, then resamples. This cuts both raw errors and correlated failures that might otherwise ripple through the chain.

Together, these mechanisms unlock predictable scaling laws. Vote requirements grow only logarithmically with step count; costs scale roughly linearly. Traditional approaches where agents handle multi-step chunks hit exponential cost walls. The insight: atomic decomposition makes reliability scalable where brute-force model expansion fails.

Week 51 (2025)

I continued exploring the MAKER model for various tasks. I saw some success with gemma3:4b, ministral-3:8b, and gemma3:12b models. I also experimented with the performance of all kinds of models I have available, rating them by speed and accuracy.

During week 51 (2025) (Dec 15 – Dec 21), Ollama processed 332 model sessions with a total runtime of 17 hr, 52 min across 46 different models.
Maker testing December 2025
Maker testing December 2025

Week 52 (2025)

While still working with MAKER, I saw great performance with Gemma 3 models and less so with Gemma 2, even though Gemma 2 models were my favorite, I really started to love Gemma 3 both for outcome and performance.

During week 52 (2025) (Dec 22 – Dec 28), Ollama processed 705 model sessions with a total runtime of 2 days, 15 hr across 4 different models.

Week 01 (2026)

I started using MAKER for other tasks like generating more accurate meeting summaries and task lists based on transcripts.

Dec 29, 2025 Maker, meeting analyzer.

I created using Buzz and Whisper models. This proved to be a superior method than my own implementations.

During week 01 (2026) (Dec 29 – Jan 4), Ollama processed 750 model sessions with a total runtime of 3 days, 2 hr across 5 different models.

Week 02 (2026)

During week 02 (2026) (Jan 5 – Jan 11), Ollama processed 689 model sessions with a total runtime of 3 days, 5 hr across 4 different models.

I started using gemma 3:12b for analyzing weather data on webcam images. 12b seems to be a nice middle ground for efficient processing and getting good enough results.

Week 03 (2026)

A surprisingly effective new prompting trick can give your AI assistants a serious accuracy boost with almost no extra cost: just repeat the prompt. Recent research shows that duplicating the exact same instruction or input text one or more times can improve large language model performance by up to 76% on tasks like information extraction, retrieval, and classification, because the second copy lets the model “look back” over the first and resolve ambiguities it initially missed. This works particularly well for structured, non‑reasoning workloads—think pulling fields out of messy text, tagging content, or transforming data—where you want maximum reliability without paying for a bigger model or more tokens of chain‑of‑thought reasoning. In practice, applying the technique is trivial: keep your normal prompt, then append the same instructions and/or input again, and you may see immediate gains in correctness while latency and cost remain almost unchanged, since the extra text is processed in the highly parallel prefill phase rather than the slower, autoregressive generation phase. source

During week 03 (2026) (Jan 12 – Jan 18), Ollama processed 773 model sessions with a total runtime of 2 days, 18 hr across 3 different models.

Week 04 (2026)

During week 04 (2026) (Jan 19 – Jan 25), Ollama processed 779 model sessions with a total runtime of 2 days, 19 hr across 5 different models.

Week 05 (2026)

The best and most trusted model for me is still Gemma 2, I use it for both language correction (27B) and image analysis (12B).

During week 05 (2026) (Jan 26 – Feb 1), Ollama processed 727 model sessions with a total runtime of 2 days, 15 hr across 2 different models.

Week 06 (2026)

This week I tried qwen3-coder-next:q8_0, a new local model. While I loaded it successfully, I wasn’t able to use it effectively. However, I’ve heard that its coding quality is exceptional for a local model.

During week 06 (2026) (Feb 2 – Feb 8), Ollama processed 762 model sessions with a total runtime of 2 days, 17 hr across 4 different models.

Week 07 (2026)

I deleted qwen3-coder-next:q8_0 and replaced it with qwen3-coder-next:q4_K_M. The larger model was too large, so I opted for a more compressed version. However, I haven’t been able to do anything practical with it.

During week 07 (2026) (Feb 9 – Feb 15), Ollama processed 765 model sessions with a total runtime of 2 days, 18 hr across 4 different models.

Week 08 (2026)

Ollama started supporting creating images using the model “x/z-image-turbo:latest,” a significant discovery. The images are not perfect but are great for processing through Ollama.

During week 08 (2026) (Feb 16 – Feb 22), Ollama processed 787 model sessions with a total runtime of 2 days, 20 hr across 3 different models.

Week 09 (2026)

I experimented with “qwen3-coder-next:q4_K_M,” which I can use but not in a fully integrated workflow or a genetic way. Also, my processing host, the M4 PRO, is not powerful enough for fast answers in this configuration.

During week 09 (2026) (Feb 23 – Mar 1), Ollama processed 769 model sessions with a total runtime of 2 days, 18 hr across 4 different models.

Week 12 (2026)

During week 12 (2026) (Mar 16 – Mar 22), Ollama processed 189 model sessions with a total runtime of 16 hr, 16 min across 1 different models.

Week 13 (2026)

During week 13 (2026) (Mar 23 – Mar 29), Ollama processed 764 model sessions with a total runtime of 2 days, 18 hr across 3 different models.

Week 14 (2026)

During week 14 (2026) (Mar 30 – Apr 5), Ollama processed 771 model sessions with a total runtime of 2 days, 19 hr across 5 different models.

Week 15 (2026)

During week 15 (2026) (Apr 6 – Apr 12), Ollama processed 754 model sessions with a total runtime of 2 days, 17 hr across 5 different models.

Week 16 (2026)

The model “x/z-image-turbo:latest” is currently broken and can’t generate images anymore in ollama. But is also still marked as experimental.

During week 16 (2026) (Apr 13 – Apr 19), Ollama processed 775 model sessions with a total runtime of 2 days, 19 hr across 6 different models.

Week 17 (2026)

During week 17 (2026) (Apr 20 – Apr 26), Ollama processed 842 model sessions with a total runtime of 3 days, 6 hr across 39 different models.

Week 18 (2026)

During week 18 (2026) (Apr 27 – May 3), Ollama processed 766 model sessions with a total runtime of 2 days, 20 hr across 10 different models.

Week 19 (2026)

During week 19 (2026) (May 4 – May 10), Ollama processed 985 model sessions with a total runtime of 3 days, 5 hr across 38 different models.

Week 20 (2026)

During week 20 (2026) (May 11 – May 17), Ollama processed 766 model sessions with a total runtime of 2 days, 16 hr across 16 different models.

Week 21 (2026)

During week 21 (2026) (May 18 – May 24), Ollama processed 752 model sessions with a total runtime of 2 days, 17 hr across 7 different models.

Week 22 (2026)

I switched my primary usage from gemma3:12b to gemma4:e4b. I can also handle webcams, translation, and other workflows.

Funny enough, the Gemma 12B model has 12.187 billion parameters, with a size of 7.59 GB. The Gemma4:e4b model is not far behind, featuring 7.996 billion parameters, but its size is larger at 8.95 GB. Both models use “Q4_K_M” quantization.

Despite the bigger size, I achieve 55.9 tokens/s with Gemma4:e4b compared to 27.8 tokens/s using Gemma 12B. Ollama on macOS

Id did not work for weather analysis. I had to revert from gemma4:e4b to gemma3:12b.

During week 22 (2026) (May 25 – May 31), Ollama processed 744 model sessions with a total runtime of 3 days, 2 hr across 3 different models.

Week 23 (2026)

I’m considering stopping this usage tracking at the source. Unfortunately, Ollama still doesn’t have great means to track usage, so my makeshift tool is cool but not perfect. For various applications, I have integrated client-side tracking, but it focuses less on tokens or models and more on whether it’s working.

Bar chart showing AI call health over 30 days, with counts of successes, failures, and fallbacks for OpenAI and Ollama. Overall status marked as healthy.

This is an application where I added Ollama as a fallback for when OpenAI cannot process data. This could happen during an outage, but in my case, it also happens when financial limits are reached and there is no more quota to use.

Quota limits are an actual issue that can disrupt your multi-year automated processes. Furthermore, if your OpenAI account is poorly managed, this could also affect you.

During week 23 (2026) (Jun 1 – Jun 7), Ollama processed 764 model sessions with a total runtime of 3 days, 12 hr across 4 different models.

Week 24 (2026)

Using Ollama as a fallback for gpt-4o-mini was not very successful, the excessive number of requests hindered the process. So, I implemented another fallback using the Azure Foundry model router. This also caused issues, as within one day it cost more than 1 CHF. While this amount is not substantial, it represents a huge increase in cost considering we were spending around $0.50 per month. To address this, I deployed the azure foundry model o4-mini-2025-04-16 to try and lower the costs. This experience was an interesting lesson in reliability engineering and cost management all in one. Now, having a stable subscription or rather, preventing the spending limit from being repeatedly crossed is a significant struggle for productive workload operations. Even though this process ran very well for over two years, the instability stemming from not careful noise from other projects on the same budget remains a major problem. Once a workflow is no longer dependable, it becomes more of a nuisance than an asset.

During week 24 (2026) (Jun 8 – Jun 14), Ollama processed 725 model sessions with a total runtime of 3 days, 15 hr across 23 different models.