Categories
AI

Transcribe audio with AI locally

I’m collecting a multitude of skills in local AI processing, and today it is audio to text. I switched to Ollama for most of my AI processing needs, but audio files can’t yet be processed via Ollama APIs. Therefore, I was looking for an alternative, however, I didn’t look for a RETS API; running a Docker container will be sufficient for me.

I used whisper on Docker.

The full capability is done in a PowerShell script. We download the language models and process them with the multilingual models medium or large-v3.

# Create main directory and subdirectories
$rootDir = "D:\whisper-ai"
$modelsDir = "$rootDir\models"
$testdataDir = "$rootDir\testdata"

# Create directories if they don't exist
New-Item -ItemType Directory -Force -Path $rootDir
New-Item -ItemType Directory -Force -Path $modelsDir
New-Item -ItemType Directory -Force -Path $testdataDir

# Function to download model if it doesn't exist
function Download-WhisperModel {
    param (
        [string]$modelName,
        [string]$outputPath
    )
    
    if (-not (Test-Path $outputPath)) {
        Write-Host "Downloading $modelName model..."
        $url = "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-$modelName.bin"
        Invoke-WebRequest -Uri $url -OutFile $outputPath
        Write-Host "$modelName model downloaded successfully"
    } else {
        Write-Host "$modelName model already exists"
    }
}

# Download medium and large models
Download-WhisperModel -modelName "medium" -outputPath "$modelsDir\ggml-medium.bin"
Download-WhisperModel -modelName "large-v3" -outputPath "$modelsDir\ggml-large-v3.bin"

# Function to transcribe audio
function Transcribe-Audio {
    param (
        [string]$modelPath,
        [string]$audioPath
    )
    
    # Remove existing container if it exists
    docker rm -f whisper-transcribe 2>&1 | Out-Null
    
    # Run docker with output handling
    $result = docker run `
        --name whisper-transcribe `
        -v ${modelsDir}:/app/models `
        -v ${testdataDir}:/app/testdata `
        ghcr.io/appleboy/go-whisper:latest `
        --model $modelPath `
        --audio-path $audioPath 2>&1

    # Output the result as regular output
    $result | ForEach-Object {
        Write-Host $_
    }
}

Write-Host "Setup complete! To transcribe an audio file, follow these steps:"
Write-Host "1. Place your audio file in: $testdataDir"
Write-Host "2. Run one of these commands (replace 'your-audio.wav' with your actual filename):"
Write-Host "`nFor medium model:"
Write-Host "Transcribe-Audio -modelPath '/app/models/ggml-medium.bin' -audioPath '/app/testdata/your-audio.wav'"
Write-Host "`nFor large model:"
Write-Host "Transcribe-Audio -modelPath '/app/models/ggml-large-v3.bin' -audioPath '/app/testdata/your-audio.wav'"

# Example usage (commented out)
# Transcribe-Audio -modelPath "/app/models/ggml-medium.bin" -audioPath "/app/testdata/your-audio.wav"


Transcribe-Audio -modelPath "/app/models/ggml-large-v3.bin" -audioPath "/app/testdata/your-audio.m4a"

One thing is also that usually the files are rather large and the processing time even longer. If the processing gets too long, a staged process would even become procedural. Filing the process, getting a process ID back, checking the process state by ID, and then retrieving the transcript with the ID when finished.

All of that isn’t needed with this kind of local processing.

Experiments from audio to text using model ggml-large-v3.bin (2025-01-16)

One of the craziest things I saw was that on a test, audio flies recorded from a movie on TV added “Untertitelung des ZDF, 2020.” However, this was not part of the recording, nor was it anything with ZDF.

Problem on long transcript

I was recently trying to transcribe a one-hour long file and failed. The CPU load was high for multiple hours, and I stopped it after around four hours using the medium model. I don’t know what went wrong but wanted to report the issue. I’m not sure if longer texts are supposed to be cut.