Why a Hetzner-native approach for LLM clusters makes sense
When choosing an infrastructure for large language models (LLMs), CTOs and lead engineers face a trade-off: cost control, data sovereignty and scalability. Hetzner combines these three attributes in a way that is attractive for many European companies. Unlike pure public cloud offerings, you retain physical control over your infrastructure, benefit from transparent pricing, and avoid unwanted data transfers to third parties.
We see Hetzner as a strong option especially where regulatory requirements, long data retention periods or strict audit requirements apply — typical challenges in the automotive and manufacturing sectors. From projects with STIHL, Mercedes Benz and Eberspächer we know the practical requirements: traceable logs, tenant-specific isolation and the ability to scale infrastructure quickly and cost-effectively.
Architecture overview: From user request to model
A robust, Hetzner-native AI backend consists of several clearly separated components. This combination provides the balance between performance, cost and auditability that we recommend for production systems:
- API Gateway & Auth – Exposes endpoints and handles authentication, rate limiting and audit headers.
- Multi-Model-Router – Dynamically decides which model (local or external) is used for a request.
- Model-Serving-Cluster – GPU- or CPU-based servers that host the models (vLLM, Hugging Face Inference Server, Triton or similar).
- Queueing & Orchestration – Asynchronous processing for long inference runs (Redis Streams, RabbitMQ or Kafka).
- State & Metadata – PostgreSQL for tenant metadata, audit logs and configurations.
- Object Storage – MinIO as an S3-compatible solution for fine-tuning data, embeddings and artifacts.
- Cache & Locks – Redis for fast vector lookups, rate limits and distributed locks.
- Observability & Security – Prometheus, Grafana, centralized logs and SIEM integration.
Visually, the data flow can be sketched in three layers: Ingress/API → Routing/Queueing → Model-Serving + Storage. This strict separation ensures each element can be scaled, audited and secured independently.
Core components in detail: Postgres, MinIO, Redis, queueing and Multi-Model-Router
PostgreSQL: The auditable backbone
Postgres serves not only as a metadata database but as the central source for configurations, audit trails and tenant isolation. We recommend a highly available Postgres installation (Primary/Replica) with WAL archiving and regular backups to MinIO. In regulated environments it’s not enough to keep logs only temporarily — that’s why we configure Write-Once audit tables and role-based access controls to guarantee traceability.
MinIO: S3-compatible object storage on-premise
MinIO is our standard for object storage in Hetzner deployments. It is performant, scales horizontally and enables the use of familiar S3 toolchains. We store models, checkpoints, fine-tuning data and embeddings in MinIO and use object-based IAM policies to ensure tenant-specific access control.
Redis & queueing: Real-time and asynchronous workloads
Redis is used for caching, embedding index lookups (e.g. Faiss-like structures), rate limits and distributed locks. For asynchronous inference pipelines we recommend Redis Streams or RabbitMQ as the queueing layer. This approach lets us decouple batch inferences and background processing, which simplifies throughput control and keeps SLOs stable.
Multi-Model-Router: Dynamic routing for cost and performance
The Multi-Model-Router is the core of flexibility: it evaluates request parameters (cost budget, latency requirement, tenant policy) and decides whether the request should go to a large local model, a smaller edge model or an external SaaS API. This way we combine the best of both worlds: low cost for standard requests and highest quality when needed.
Deployment: How we use Coolify at Reruption
For fast, reproducible deployments we use Coolify at Reruption. Coolify allows us to define infrastructure components as templates, manage secrets securely and set up CI/CD pipelines for models and services. On Hetzner, complete stacks including networking, LB, firewalls and monitoring can be provisioned in days, not weeks.
Our typical deployment strategy:
- Infrastructure-as-Code for networks and firewalls.
- Coolify apps for API gateway, router, model server and scheduler.
- Helm charts or Docker Compose for consistent environment packages.
- Automatic rollbacks and canary deployments for model rollouts.
Coolify not only speeds up provisioning but also reduces operational complexity — a prerequisite when taking responsibility like a co‑founder (Co‑Preneur).
Ready to Build Your AI Project?
Let's discuss how we can help you ship your AI project in weeks instead of months.
Example stack: LLMs + APIs + automation nodes
A concrete example of a production stack on Hetzner looks like this for us:
- Frontend/API Gateway: Nginx/Traefik + OAuth2 / mTLS
- Auth & Tenant Service: Postgres, Keycloak
- Router Service: Configurable Multi-Model-Router (k8s deployment)
- Model Serving: Combination of GPU metal servers (vLLM/Triton) and CPU scalers (Llama.cpp)
- Storage: MinIO (expandable), backup -> offsite archive
- Cache & Queue: Redis (cluster) + Redis Streams for workflows
- Automation Nodes: Kubernetes CronJobs or Airflow for fine-tuning, indexing, retraining
- Observability: Prometheus, Grafana, Loki, SIEM integration
Workflow example: A user request hits the API gateway → auth check → router evaluates budget & latency → request placed into Redis Stream (if asynchronous) → model server processes → result + audit log stored in Postgres → artifacts saved to MinIO. Every step is versioned and auditable.
Cost comparison: SaaS vs. self-host (realistic assessment)
A fair cost comparison must consider total cost of ownership (TCO) and hidden costs: direct API fees, data exfiltration risk, compliance effort, personnel costs for operations and engineering.
Short model (monthly, rough range for moderate production operations):
- SaaS LLM (with high usage): €5,000–€40,000 (depending on token volume and latency/quality requirements).
- Self-Host Hetzner (including GPU servers, storage, network, monitoring): One-time PoC & setup ~€9,900 (our AI PoC offering) + ongoing €6,000–€20,000/month for moderate production load. With heavy scaling, GPU costs can rise accordingly.
Important: Self-host can pay off quickly depending on usage if you have steady loads or strict data protection requirements. SaaS is attractive for very sporadic usage or when time-to-market is paramount. In our experience, the break-even often lies within a few months to a year, depending on consumption profile.
Security, compliance and tenant isolation
In regulated industries three aspects are decisive: data sovereignty, auditability and access control. On Hetzner we can configure physical network separation, VPC layers, dedicated GPU nodes and WORM-capable backups (Write Once Read Many) — all building blocks to meet compliance requirements.
We implement tenant isolation through Kubernetes namespaces, separate MinIO buckets with individual IAM policies, and Postgres schemas per tenant. Additionally, we add mandatory logging hooks in the router pipeline so every request is annotated with the correct audit context. Security reviews and pen tests are part of every production release.
Lessons learned from regulated projects
From our work with STIHL, Mercedes Benz and Eberspächer several recurring insights emerged:
- Start with clear data classes: not all data may be processed the same way. Classifying early saves a lot of effort later.
- Auditability is not an add-on: implement logging and versioning from the start.
- Hybrid operation works best: local models for sensitive data, SaaS for rare special cases.
- Operationalization takes time: plan 20–30% of project time for observability, security and runbook creation.
These lessons reflect our Co‑Preneur philosophy: we don’t just design — we build and operate with you until the stack reliably contributes to the customer’s P&L.
Want to Accelerate Your Innovation?
Our team of experts can help you turn ideas into production-ready solutions.
Roadmap: From PoC to stable production
A realistic implementation roadmap in four phases:
- PoC (2–4 weeks): Validate a use case, small model on a dedicated Hetzner VM, first integration with Postgres/MinIO. (Our AI PoC: €9,900.)
- MVP (4–8 weeks): Multi-Model-Router, queueing, basic observability, tenant isolation for pilot customers.
- Production (8–12 weeks): HA setups, backup/DR, scaling, security audit, SLA definitions.
- Optimization & Automation (ongoing): cost optimization, auto-scaling rules, fine-tuning pipelines and retraining automation.
We recommend defining runbooks and incident processes early. In regulated environments, regular compliance checks and penetration tests should also be planned.
Practical examples and recommendations
Concrete recommendations you can implement right away:
- Start with a concrete use case (e.g. recruiting chatbot, document search) and measure token/query volume before designing the architecture.
- Use MinIO from the beginning for model artifacts to avoid migrations later.
- Implement Multi-Model-Routing so you can flexibly switch between local models and external APIs.
- Plan for observability with metrics, traces and logs — otherwise in an incident your only option will be guesswork.
Conclusion & Call to Action
Hetzner-native LLM clusters offer a powerful blend of cost control, data sovereignty and scalability. With a clear architecture based on Postgres, MinIO, Redis, queueing and a Multi-Model-Router, companies can achieve the necessary balance between performance and compliance. Coolify helps us roll out such stacks quickly and reproducibly — a decisive advantage for companies that want not just a proof-of-concept but a true production product.
If you’re considering whether self-host on Hetzner fits your LLM strategy, we’re happy to support you: from a focused PoC (€9,900) to full implementation and operation. Contact us so we can design the right architecture for your requirements and embed it in your organization.