Categories
AI

What i learned about Ollama and Models

While I’m aware those models aren’t limited or bound to Ollama, is Ollama still the way I interface and use them? Here I try to keep notes on how I use certain models and what I like or dislike about them.

This is purely a subjective view, there are many attempts at objectively measuring the performance of models. This is not it.

ModelVRAM SizeQuantizationRemarks from my personal experience
gemma3:27b20627.03 MBQ4_K_M📕 When using Gemma 3 to correct my English, I see that the text it creates, while grammatically correct, no longer means the same thing. Therefore, I prefer Gemma2 for correcting English as it does it well and preserves the content and writing style more.
👁‍🗨 Supports vision
gemma2:27b19903.46 MBQ4_0📕 Very good in fixing English or German texts.
⭐This is my most used and favorite model at the moment.
mannix/gemma2-9b-simpo:latest9080.95 MBQ4_0📕 Very good at translating into English and fixing English.
qwen3:32b24773.64 MBQ4_K_M
qwen3:30b-a3b20293.31 MBQ4_K_MFor fast answers
deepseek-r1:32b22385.44 MBQ4_K_M
mistral:7b6095.05 MBQ4_0📘 An efficient and cheap model that stays truer than others when summarizing texts when processing time is an issue. I did summarize over 300’000 text samples in around ~400 hours.
falcon3:10b8176.71 MB8176.71 MB
llama3.1:latest6942.47 MBQ4_K_M
llama3.2:latest4090.59 MBQ4_K_M
llama3.2-vision:latest11696.27 MBQ4_K_M👁‍🗨 Supports vision
devstral:latest16035.07 MBQ4_K_MDevstral is built under a collaboration between Mistral AI and All Hands AI 🙌
Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM, making it an ideal choice for local deployment and on-device use.
Osmosis/Osmosis-Structure-0.6B:latest2,905.68unknownA specialized small language model (SLM) designed to excel at structured output generation
bge-m31690.55 MBF16🎯 This is an multi-language embedding model the model of choice for my RAG. I usually combine it with Qdrant
⭐This is my most used and favorite embedding model at the moment.

Reflection on what I run on Ollama

In this section, I will reflect on my actual usage. Read more about this graphs.

Week 18 (2025)

I’m a user of all kinds of AI, including OpenAI ChatGPT or Anthropic Cloud. Potentially soon also Grok API via Azure. But for some data I process locally as I don’t want to send the data to the APIs of those companies, as well as for my personal learning journey. We see this week large usage of Mistral:7B. I like to use Mistral:7B summarization when power consumption is an issue. As I was processing a hundred thousand entries, I was using Mistral:7B over Gemma2:27B for keeping the accuracy relatively well while being able to process in OK time on my Mac.

During week 18 (2025) (Apr 28 – May 4), Ollama processed 11 model sessions with a total runtime of 7 hr, 4 min across 8 different models.

Week 19 (2025)

I’m a huge fan of gemma2:27b and was excited with the release of gemma3:27b. However, after testing and comparing both models, I like gemma2:27b because on my summarization and translation workflows it proved to create results closer to what I expected. Gemma3:27b did a lot more “creative” changes and even changed the context in unintended ways.

During week 19 (2025) (May 5 – May 11), Ollama processed 58 model sessions with a total runtime of 4 hr, 37 min across 8 different models.

Week 20 (2025)

I more or less abandoned my summarization and translation workflow attempts with gemma-3b:27 and switched back to gemma-2:27, which is a kickass model. I also updated the Qdrant vector database with new content, where my go-to model is bge-m3:latest.

I was also exploring the new model qwen-30b-a3b, but couldn’t find a use case for it. But always good to see and try.

During week 20 (2025) (May 12 – May 18), Ollama processed 63 model sessions with a total runtime of 1 day, 9 hr across 5 different models.

Week 21 (2025)

This was more or less a regular week with large summarization jobs running on my still favorite model, Gemma 2:27B.

During week 21 (2025) (May 19 – May 25), Ollama processed 34 model sessions with a total runtime of 1 day across 10 different models.

Week 22 (2025)

We see two new models on the list. I was playing with devstral:latest, a derivative of Mistral tailored for coding. I used it in conjunction with AllHands, which gives a chat interface to build your applications. This was my best outcome to date with local, limited processing code generation.

In general, however, we see that my regular summarization and translation workflow is still processing the bulk of the time using Gemma2.

During week 22 (2025) (May 26 – Jun 1), Ollama processed 31 model sessions with a total runtime of 16 hr, 28 min across 4 different models.

Week 23 (2025)

This was not a good week for my Mac. After a power outage from a storm, I lost access for about five days. Therefore, fewer workloads were processed this week, and nothing new could be tested. I was forced to use low-power Gemma 2 models and found that they are not up to the task. Below Gemma:12b is pretty much not usable for translation and summarization in my case. I’m happy my Mac is now working again.

During week 23 (2025) (Jun 2 – Jun 8), Ollama processed 7 model sessions with a total runtime of 1 day, 15 hr across 2 different models.

Week 24 (2025)

Just a very regular week, but also comparing some Gemma again. But it just can’t convince me.

During week 24 (2025) (Jun 9 – Jun 15), Ollama processed 39 model sessions with a total runtime of 4 hr, 59 min across 3 different models.

Week 25 (2025)

I did some experimentation in understanding tasks from tickets using reasoning models. My favorite for local hosting remains deepseek-r1:32b.

During week 25 (2025) (Jun 16 – Jun 22), Ollama processed 88 model sessions with a total runtime of 7 hr, 6 min across 11 different models.

Week 26 (2025)

Just a boring week.

During week 26 (2025) (Jun 23 – Jun 29), Ollama processed 52 model sessions with a total runtime of 15 hr, 58 min across 2 different models.

Week 27 (2025)

I didn’t have time and need to use an LLM this week.

During week 27 (2025) (Jun 30 – Jul 6), Ollama processed 2 model sessions with a total runtime of 10 min, 32 sec across 2 different models.

Week 28 (2025)

I didn’t have time and need to use an LLM this week.

During week 28 (2025) (Jul 7 – Jul 13), Ollama processed 1 model sessions with a total runtime of 5 min, 20 sec across 1 different models.

Week 29 (2025)

After the quiet two weeks, I rerun summarization and translation workflows to catch up on missed updates.

During week 29 (2025) (Jul 14 – Jul 20), Ollama processed 12 model sessions with a total runtime of 2 days, 1 hr across 1 different models.

Week 30 (2025)

It was a rather quiet week, I accumulated a large backlog of translations and summaries and did not have enough time to lower it. Also, I read about the newly released model Mistral-Small 3.2:24B but did not find a use case yet.

During week 30 (2025) (Jul 21 – Jul 27), Ollama processed 43 model sessions with a total runtime of 1 day, 12 hr across 2 different models.

Week 31 (2025)

I went on the hunt for a more efficient model for my use case of summarization and translation. I used LM Arena to determine users’ favorites with a lower parameter size. My current favorite model is gemma2-24b, now I switch to gemma2-9b-simpo in the hope of gaining similar results with less power.

https://lmarena.ai/leaderboard/text (2025-07-28)

This is the model I try to replace my high-volume workflows where I previously used Mistral:7B, then tried Gemma 2:27B which just took too long. Now I hope Gemma 2-9B Simpo will solve the issues here for me.

One benefit I noticed already is that when I send occasional competing work packages to the Ollama API, it’s less disruptive for these queries as the system finds a sport faster on Ollama between my bulk translation and summarization. Therefore, my other low-volume processes are less likely to time out.

During week 31 (2025) (Jul 28 – Aug 3), Ollama processed 42 model sessions with a total runtime of 3 days, 5 hr across 2 different models.

Week 32 (2025)

No experiments this week, only regular summarization and translation workflows will continue using the new model.

During week 32 (2025) (Aug 4 – Aug 10), Ollama processed 30 model sessions with a total runtime of 18 hr, 27 min across 4 different models.

Week 33 (2025)

OpenAI dropped an open model “gpt-oss.” I went out to get it in the 20B parameter version. They describe it as “OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases,” and I will try it out.

Week 50 (2025)

During week 50 (2025) (Dec 8 – Dec 14), Ollama processed 31 model sessions with a total runtime of 3 hr, 12 min across 4 different models.

I started implementing with MAKER. This is a very interesting approach I started using analyzing unstructured maintenance windows arriving in free text.

MAKER achieves reliability through decomposition, not scale. Rather than depending on larger models, it orchestrates networks of smaller ones using three interlocking principles:

Maximal Agentic Decomposition (MAD): Tasks fragment into their smallest meaningful units often one decision per agent. Each receives only essential context for its step. This isolation prevents error cascade, contains context drift, and makes corrections surgically precise.

First-to-ahead-by-k voting: Multiple agents tackle identical steps simultaneously. The system locks in whichever action first pulls k votes ahead of alternatives, creating fast local consensus. Modest per-step accuracy gains compound exponentially across thousands of steps, converting local agreement into system-wide dependability.

Red-flagging: Certain outputs betray confusion through telltale pattern excessive verbosity, malformed structure. MAKER preemptively filters these before they enter the vote pool, then resamples. This cuts both raw errors and correlated failures that might otherwise ripple through the chain.

Together, these mechanisms unlock predictable scaling laws. Vote requirements grow only logarithmically with step count; costs scale roughly linearly. Traditional approaches where agents handle multi-step chunks hit exponential cost walls. The insight: atomic decomposition makes reliability scalable where brute-force model expansion fails.

Week 51 (2025)

I continued exploring the MAKER model for various tasks. I saw some success with gemma3:4b, ministral-3:8b, and gemma3:12b models. I also experimented with the performance of all kinds of models I have available, rating them by speed and accuracy.

During week 51 (2025) (Dec 15 – Dec 21), Ollama processed 332 model sessions with a total runtime of 17 hr, 52 min across 46 different models.
Maker testing December 2025
Maker testing December 2025

Week 52 (2025)

While still working with MAKER, I saw great performance with Gemma 3 models and less so with Gemma 2, even though Gemma 2 models were my favorite, I really started to love Gemma 3 both for outcome and performance.

During week 52 (2025) (Dec 22 – Dec 28), Ollama processed 705 model sessions with a total runtime of 2 days, 15 hr across 4 different models.

Week 01 (2026)

I started using MAKER for other tasks like generating more accurate meeting summaries and task lists based on transcripts.

Dec 29, 2025 Maker, meeting analyzer.

I created using Buzz and Whisper models. This proved to be a superior method than my own implementations.

During week 01 (2026) (Dec 29 – Jan 4), Ollama processed 750 model sessions with a total runtime of 3 days, 2 hr across 5 different models.

Week 02 (2026)

During week 02 (2026) (Jan 5 – Jan 11), Ollama processed 689 model sessions with a total runtime of 3 days, 5 hr across 4 different models.

I started using gemma 3:12b for analyzing weather data on webcam images. 12b seems to be a nice middle ground for efficient processing and getting good enough results.

Week 03 (2026)

A surprisingly effective new prompting trick can give your AI assistants a serious accuracy boost with almost no extra cost: just repeat the prompt. Recent research shows that duplicating the exact same instruction or input text one or more times can improve large language model performance by up to 76% on tasks like information extraction, retrieval, and classification, because the second copy lets the model “look back” over the first and resolve ambiguities it initially missed. This works particularly well for structured, non‑reasoning workloads—think pulling fields out of messy text, tagging content, or transforming data—where you want maximum reliability without paying for a bigger model or more tokens of chain‑of‑thought reasoning. In practice, applying the technique is trivial: keep your normal prompt, then append the same instructions and/or input again, and you may see immediate gains in correctness while latency and cost remain almost unchanged, since the extra text is processed in the highly parallel prefill phase rather than the slower, autoregressive generation phase. source