While I’m aware those models aren’t limited or bound to Ollama, is Ollama still the way I interface and use them? Here I try to keep notes on how I use certain models and what I like or dislike about them.
This is purely a subjective view, there are many attempts at objectively measuring the performance of models. This is not it.
| Model | VRAM Size | Quantization | Remarks from my personal experience |
|---|---|---|---|
| gemma3:27b | 20627.03 MB | Q4_K_M | 📕 When using Gemma 3 to correct my English, I see that the text it creates, while grammatically correct, no longer means the same thing. Therefore, I prefer Gemma2 for correcting English as it does it well and preserves the content and writing style more. 👁🗨 Supports vision |
| gemma2:27b | 19903.46 MB | Q4_0 | 📕 Very good in fixing English or German texts. ⭐This is my most used and favorite model at the moment. |
| mannix/gemma2-9b-simpo:latest | 9080.95 MB | Q4_0 | 📕 Very good at translating into English and fixing English. |
| qwen3:32b | 24773.64 MB | Q4_K_M | |
| qwen3:30b-a3b | 20293.31 MB | Q4_K_M | For fast answers |
| deepseek-r1:32b | 22385.44 MB | Q4_K_M | |
| mistral:7b | 6095.05 MB | Q4_0 | 📘 An efficient and cheap model that stays truer than others when summarizing texts when processing time is an issue. I did summarize over 300’000 text samples in around ~400 hours. |
| falcon3:10b | 8176.71 MB | 8176.71 MB | |
| llama3.1:latest | 6942.47 MB | Q4_K_M | |
| llama3.2:latest | 4090.59 MB | Q4_K_M | |
| llama3.2-vision:latest | 11696.27 MB | Q4_K_M | 👁🗨 Supports vision |
| devstral:latest | 16035.07 MB | Q4_K_M | Devstral is built under a collaboration between Mistral AI and All Hands AI 🙌 Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM, making it an ideal choice for local deployment and on-device use. |
| Osmosis/Osmosis-Structure-0.6B:latest | 2,905.68 | unknown | A specialized small language model (SLM) designed to excel at structured output generation |
| bge-m3 | 1690.55 MB | F16 | 🎯 This is an multi-language embedding model the model of choice for my RAG. I usually combine it with Qdrant ⭐This is my most used and favorite embedding model at the moment. |
Reflection on what I run on Ollama
In this section, I will reflect on my actual usage. Read more about this graphs.
Week 18 (2025)
I’m a user of all kinds of AI, including OpenAI ChatGPT or Anthropic Cloud. Potentially soon also Grok API via Azure. But for some data I process locally as I don’t want to send the data to the APIs of those companies, as well as for my personal learning journey. We see this week large usage of Mistral:7B. I like to use Mistral:7B summarization when power consumption is an issue. As I was processing a hundred thousand entries, I was using Mistral:7B over Gemma2:27B for keeping the accuracy relatively well while being able to process in OK time on my Mac.

Week 19 (2025)
I’m a huge fan of gemma2:27b and was excited with the release of gemma3:27b. However, after testing and comparing both models, I like gemma2:27b because on my summarization and translation workflows it proved to create results closer to what I expected. Gemma3:27b did a lot more “creative” changes and even changed the context in unintended ways.

Week 20 (2025)
I more or less abandoned my summarization and translation workflow attempts with gemma-3b:27 and switched back to gemma-2:27, which is a kickass model. I also updated the Qdrant vector database with new content, where my go-to model is bge-m3:latest.
I was also exploring the new model qwen-30b-a3b, but couldn’t find a use case for it. But always good to see and try.

Week 21 (2025)
This was more or less a regular week with large summarization jobs running on my still favorite model, Gemma 2:27B.

Week 22 (2025)
We see two new models on the list. I was playing with devstral:latest, a derivative of Mistral tailored for coding. I used it in conjunction with AllHands, which gives a chat interface to build your applications. This was my best outcome to date with local, limited processing code generation.
In general, however, we see that my regular summarization and translation workflow is still processing the bulk of the time using Gemma2.

Week 23 (2025)
This was not a good week for my Mac. After a power outage from a storm, I lost access for about five days. Therefore, fewer workloads were processed this week, and nothing new could be tested. I was forced to use low-power Gemma 2 models and found that they are not up to the task. Below Gemma:12b is pretty much not usable for translation and summarization in my case. I’m happy my Mac is now working again.

Week 24 (2025)
Just a very regular week, but also comparing some Gemma again. But it just can’t convince me.

Week 25 (2025)
I did some experimentation in understanding tasks from tickets using reasoning models. My favorite for local hosting remains deepseek-r1:32b.

Week 26 (2025)
Just a boring week.

Week 27 (2025)
I didn’t have time and need to use an LLM this week.

Week 28 (2025)
I didn’t have time and need to use an LLM this week.

Week 29 (2025)
After the quiet two weeks, I rerun summarization and translation workflows to catch up on missed updates.

Week 30 (2025)
It was a rather quiet week, I accumulated a large backlog of translations and summaries and did not have enough time to lower it. Also, I read about the newly released model Mistral-Small 3.2:24B but did not find a use case yet.

Week 31 (2025)
I went on the hunt for a more efficient model for my use case of summarization and translation. I used LM Arena to determine users’ favorites with a lower parameter size. My current favorite model is gemma2-24b, now I switch to gemma2-9b-simpo in the hope of gaining similar results with less power.

This is the model I try to replace my high-volume workflows where I previously used Mistral:7B, then tried Gemma 2:27B which just took too long. Now I hope Gemma 2-9B Simpo will solve the issues here for me.
One benefit I noticed already is that when I send occasional competing work packages to the Ollama API, it’s less disruptive for these queries as the system finds a sport faster on Ollama between my bulk translation and summarization. Therefore, my other low-volume processes are less likely to time out.

Week 32 (2025)
No experiments this week, only regular summarization and translation workflows will continue using the new model.

Week 33 (2025)
OpenAI dropped an open model “gpt-oss.” I went out to get it in the 20B parameter version. They describe it as “OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases,” and I will try it out.
Week 50 (2025)

I started implementing with MAKER. This is a very interesting approach I started using analyzing unstructured maintenance windows arriving in free text.
MAKER achieves reliability through decomposition, not scale. Rather than depending on larger models, it orchestrates networks of smaller ones using three interlocking principles:
Maximal Agentic Decomposition (MAD): Tasks fragment into their smallest meaningful units often one decision per agent. Each receives only essential context for its step. This isolation prevents error cascade, contains context drift, and makes corrections surgically precise.
First-to-ahead-by-k voting: Multiple agents tackle identical steps simultaneously. The system locks in whichever action first pulls k votes ahead of alternatives, creating fast local consensus. Modest per-step accuracy gains compound exponentially across thousands of steps, converting local agreement into system-wide dependability.
Red-flagging: Certain outputs betray confusion through telltale pattern excessive verbosity, malformed structure. MAKER preemptively filters these before they enter the vote pool, then resamples. This cuts both raw errors and correlated failures that might otherwise ripple through the chain.
Together, these mechanisms unlock predictable scaling laws. Vote requirements grow only logarithmically with step count; costs scale roughly linearly. Traditional approaches where agents handle multi-step chunks hit exponential cost walls. The insight: atomic decomposition makes reliability scalable where brute-force model expansion fails.
Week 51 (2025)
I continued exploring the MAKER model for various tasks. I saw some success with gemma3:4b, ministral-3:8b, and gemma3:12b models. I also experimented with the performance of all kinds of models I have available, rating them by speed and accuracy.



Week 52 (2025)
While still working with MAKER, I saw great performance with Gemma 3 models and less so with Gemma 2, even though Gemma 2 models were my favorite, I really started to love Gemma 3 both for outcome and performance.

Week 01 (2026)
I started using MAKER for other tasks like generating more accurate meeting summaries and task lists based on transcripts.

I created using Buzz and Whisper models. This proved to be a superior method than my own implementations.

Week 02 (2026)

I started using gemma 3:12b for analyzing weather data on webcam images. 12b seems to be a nice middle ground for efficient processing and getting good enough results.
Week 03 (2026)
A surprisingly effective new prompting trick can give your AI assistants a serious accuracy boost with almost no extra cost: just repeat the prompt. Recent research shows that duplicating the exact same instruction or input text one or more times can improve large language model performance by up to 76% on tasks like information extraction, retrieval, and classification, because the second copy lets the model “look back” over the first and resolve ambiguities it initially missed. This works particularly well for structured, non‑reasoning workloads—think pulling fields out of messy text, tagging content, or transforming data—where you want maximum reliability without paying for a bigger model or more tokens of chain‑of‑thought reasoning. In practice, applying the technique is trivial: keep your normal prompt, then append the same instructions and/or input again, and you may see immediate gains in correctness while latency and cost remain almost unchanged, since the extra text is processed in the highly parallel prefill phase rather than the slower, autoregressive generation phase. source
