Reflekt Lab - Running On-Demand GPU Inference in Europe With Terraform and Cloud Run

We run several AI models in production for our clients: Whisper for speech-to-text, a speaker diarization service, and a multi-step inference pipeline. They process voice messages (mostly from WhatsApp) for businesses in Europe.

An always-on L4 GPU node for something like Whisper or a 7B model costs around $5,000/month. For a while, that's what we were paying. This article is about how we got that number down.

The starting point

The Whisper service sat on a GKE cluster with a dedicated GPU node, provisioned 24/7 regardless of traffic. Most of the time, nobody was sending voice messages. The diarization service (speaker identification using sherpa-onnx) ran as two always-on replicas on the same cluster, CPU-only.

These workloads are inherently bursty: traffic peaks during business hours, then drops to near zero overnight. We needed the GPU when a user sent a voice note, not the other 23 hours of the day.

The obvious answer was serverless: scale to zero when idle, spin up on demand.

Moving to Cloud Run v2

Both services now run on Google Cloud Run v2. Whisper uses an NVIDIA L4 GPU; diarization is CPU-only but uses the same Terraform module for consistency.

The key configuration:

scaling {
  min_instance_count = 0   # No instances when idle
  max_instance_count = 2   # Limited by GPU quota
}

resources {
  limits = {
    cpu              = "4000m"
    memory           = "16Gi"
    "nvidia.com/gpu" = "1"
  }
}

Both services scale to zero when there's no traffic and spin up on demand. They communicate over gRPC.

We built a single Terraform module (cloud_run_gpu) that handles both GPU and CPU-only deployments. Adding a new service is a ~20-line module block in the environment file.

That part was straightforward. The first real constraint came from the region.

GPU availability in Europe

Our main infrastructure runs in europe-west9 (Paris), but L4 GPUs on Cloud Run weren't available there. We had to deploy GPU services in europe-west1 (Belgium) and bridge the two with a VPC Serverless Connector.

It's not a big deal operationally, but it's the kind of thing you want to check before designing the rest of the architecture. GPU region availability in Europe is still limited compared to the US.

With the region sorted out, the next question was performance: if the service starts from zero, how long does the user wait?

Cold starts

When a Cloud Run GPU instance starts from zero, it needs to initialize CUDA, load the model into GPU memory (Whisper large-v3 is ~3 GB), and pass health checks. Based on our production logs, this consistently takes about 30 seconds, from "Starting new instance" to startup probe passing. Diarization (CPU-only, ONNX models) is around 10 seconds.

Better than we expected, but getting there required separating startup probes from liveness probes. Our first config had a short initial delay and low failure threshold. The liveness probe was killing the container before the model finished loading:

startup_probe {
  grpc { port = 50051 }
  initial_delay_seconds = 30
  period_seconds        = 10
  failure_threshold     = 30   # Allows up to ~300s for initialization
}

liveness_probe {
  grpc { port = 50051 }
  initial_delay_seconds = 10
  period_seconds        = 5
  failure_threshold     = 3
}

The startup probe allows up to 5 minutes, far more than the ~30 seconds actually needed, but that margin costs nothing and prevents false kills during occasional slower starts. The liveness probe only kicks in after startup succeeds.

Cold starts were the most visible issue, but not the only one. Here's what else came up as we put this into production.

Other things that tripped us up

Health check protocol mismatch. The diarization service exposes both HTTP (8080) and gRPC (50051). The health probe was hitting the gRPC port with an HTTP request. gRPC responds with an HTTP/2 preface, which the HTTP probe interprets as invalid. The service kept getting marked unhealthy and restarting. The fix: declare the HTTP port first (Kubernetes probes default to the first port) and use gRPC health checks for gRPC services.

GPU quota limits. Cloud Run's L4 GPU quota in Europe is capped at 2 instances per region under the non-zonal-redundancy pool. We disable zonal redundancy explicitly in Terraform. Without that flag, the cap was lower. This is enough for our current load but it's a hard ceiling.

gpu_zonal_redundancy_disabled = true

Terraform state drift. Cloud Run revision names and image digests change on every CI/CD deployment (Cloud Build, triggered by GitHub releases). We had to tell Terraform to ignore image changes and prevent create-before-destroy behavior, otherwise Terraform tries to create a new revision before deleting the old one, which exceeds the GPU quota and fails.

lifecycle {
  create_before_destroy = false
  ignore_changes = [
    template[0].containers[0].image,
  ]
}

Memory for ONNX models. The diarization service loads two models (~250 MB total). We initially deployed with 512 MB in staging and it got OOMKilled. No application-level error, the container just disappeared. Bumped to 1 GB in staging, 4 GB in production. Models need more headroom than their file size suggests.

Timeout chain. Whisper has a 1-hour Cloud Run timeout because long audio files can take 20+ minutes to transcribe. But the upstream gRPC client was set to 30 seconds by default. Transcriptions were failing silently. Every hop in the chain needs matching timeouts.

With all of that sorted out, here's what the system looks like today.

How it works end to end

A voice message arrives via WhatsApp. The messaging bot stores the audio in GCS. The classifier calls the diarization service (Cloud Run, CPU) to identify speakers, then calls Whisper (Cloud Run, GPU) to transcribe. The transcript goes to a RAG service for embedding and storage.

If both services are at zero instances, the first message after an idle period takes about 30-40 seconds (GPU cold start dominates). Subsequent messages in the same session hit warm instances and complete in seconds. For async voice message processing, that's a fine tradeoff.

So, back to the original question: what does it cost now?

What it costs now

Remember the $5,000/month for an always-on GPU node. Diarization on GKE with two always-on replicas added another ~$500/month on top, for a service with traffic concentrated during business hours and quiet the rest of the time.

After the move to Cloud Run with scale-to-zero, diarization dropped to ~$20/month. Whisper only bills during actual transcription, which in our case is 2-3 hours per day. We also don't deploy GPU services in staging at all. Staging points to the production endpoints, which halves GPU quota usage and avoids duplicate costs.

	Before	After
Diarization	~$500/mo (2 GKE replicas, 24/7)	~$20/mo (Cloud Run, scale-to-zero)
Whisper	~$5,000/mo (always-on L4 GPU node)	Billed per use (~2-3 hrs/day)
GPU type	n/a	NVIDIA L4 (europe-west1)
Max GPU instances	n/a	2 (regional quota limit)
Cold start (GPU)	None (always on)	~30s (from production logs)
Cold start (CPU)	None (always on)	~10s (from production logs)
Terraform	Separate configs per service	1 shared module

What we'd do differently

Not much. We'd check GPU region availability earlier. We assumed our primary region would have L4s and had to rearchitect the networking when it didn't.

We'd also set up the startup/liveness probe separation from day one instead of debugging false-positive health check failures after deployment.

The rest was iterative. The infrastructure evolved over several months through production use, and most of the fixes in this article came from actual incidents.

We're a small team building AI tools for businesses. This is one piece of the infrastructure behind it. If any of this is relevant to what you're working on, feel free to reach out.