The question now arises in almost every conversation with our customers: “We would like to use AI seriously – but our data must not leave the house. Is that even possible?” The short answer is yes. And in 2026 it will be much easier to give than two years ago.
The reason is a double development. On the one hand, open language models have become so good that they are barely behind the big cloud services for most office tasks. On the other hand, there is now enough computing power in a single server that a small team can work productively with it. Local AI is no longer a research project, but a tangible option for every medium-sized company that wants to keep its data under control.
Table of Contents

We have taken this path ourselves – for our own team of five and for a number of customer projects. In this article, we share what has proven successful: in terms of hardware, software stack and, above all, the question of what is worth the effort.
What local AI actually means
Local AI (also known as on-premise AI or self-hosted AI) refers to AI models that run on the company’s own hardware – instead of via the cloud of an external provider. All entries and documents remain in the company’s own network. There is no token billing per request and no connection to a third-party API.
The practical difference can be seen in an example. If you send a draft contract to a cloud service, this text leaves your premises, is processed on third-party servers and is subject to the terms and conditions of the provider. If you have the same design summarized by a model on your own server, none of this happens. That’s exactly the core: data sovereignty. For industries with sensitive content – law, human resources, accounting, research – this is often not a nice-to-have, but the condition for being allowed to use AI at all.
In addition, there are two sober advantages that are rarely talked about: The costs can be planned because a request is not billed separately, and the system works even if the Internet connection fails or a provider changes its pricing model overnight.
Hardware: What’s in the closet
When it comes to hardware, many people make the same mistake: they look at the CPU first. In truth, almost everything in AI is decided by the graphics card – and specifically by the video memory (VRAM). Put simply, the larger the model you want to charge, the more VRAM you need. The rest of the computer is an accessory.
The good news: You don’t have to start in the luxury class. A current consumer card with 24 to 32 GB of VRAM, such as a GeForce RTX 4090 or RTX 5090, is already enough to run powerful models smoothly up to around 30 billion parameters. This covers an astonishing amount – summaries, drafts, research, code help.
If you want more – larger models, longer documents, several employees at the same time – you end up with a professional card. The current measure of things on the desk is the NVIDIA RTX PRO 6000 Blackwell with 96 GB of VRAM. This one card runs a 70 billion parameter model with comfortable scope for multiple parallel queries. The catch: Depending on the daily price, it costs roughly between 8,000 and 9,200 euros. To do this, it replaces a whole range of cloud subscriptions – every year anew.
| Scenario | GPU Example | VRAM | Suitable model size | Rough investment |
|---|---|---|---|---|
| First tests, 1 person | RTX 4090 | 24 GB | up to ~14B | from ~2.000 € |
| Small team, everyday life | RTX 5090 | 32 GB | up to ~30B | from ~3.000 € |
| Multiple users, large models | RTX PRO 6000 Blackwell | 96 GB | up to ~70B | ~€8,000–€9,200 |
All around: Generous RAM (at least 64, better 128 GB), a fast NVMe SSD for the model files and a reasonable power supply. And yes – such a card draws power noticeably under full load (the professional version up to 600 watts) and needs cooling. In a normal office cabinet in the server room, however, this is easily manageable.
Rule of thumb: It’s not the largest possible model that counts, but the model that responds quickly enough on your card.
Software: The stack that connects everything
The hardware is only half the battle. Only the software turns the graphics card into a useful assistant. Fortunately, this stack can now be assembled entirely from open-source building blocks – without license fees and without vendor lock-in.
The model itself
Here, the situation has turned rapidly. Three families in particular are of interest to German-speaking companies: the Qwen models (very strong German, over 100 languages, consistently under the permissive Apache 2.0 license), Mistral from France (European, lean, good at following instructions) and Google’s Gemma (strong on a single card – but the license is worth a second look before commercial use). Which version is currently ahead changes practically monthly; the families themselves are the more reliable choice.
An important trick is called quantization: In this process, a model is compressed in such a way that it can get by with less memory without losing any significant quality. A level like “Q4” is a good starting point in practice – it roughly halves the memory requirement, the difference in answers is usually hardly noticeable for office tasks.
The tools around it
All you need to do is try out Ollama or LM Studio – the first model will run in minutes. For productive multi-user operation, the company relies on a real inference engine such as vLLM, which efficiently serves several requests in parallel. OpenWebUI has established itself as an interface that feels like a familiar chat.
However, the real added value is usually only created through RAG (Retrieval-Augmented Generation). Simplified: The model gets access to your own documents, searches for the appropriate passages in them and responds on this basis – with reference to the source. This requires an embedding model (such as bge-m3 for multilingual content) and a vector database such as Qdrant. If you also want to process speech, you can add Whisper for transcription. That sounds like a lot – but it’s a tried-and-tested, documented construction kit, not a self-made one.
Use cases from everyday work
Technology is one thing, the concrete benefit is another. In our experience, the following bets pay off the fastest:
- Knowledge assistant about your own documentation. Manuals, guidelines, project reports, old offers – searchable by RAG, with answers in full sentences instead of a hit list. Saves exactly the searching that no one likes to do.
- Process documents and emails. Classify incoming mail, summarize long attachments, prepare draft responses. Especially in the case of confidential correspondence, local operations are often the decisive argument here.
- Connection to ERP and specialist systems. It gets really exciting when the AI does not respond in a vacuum, but accesses real business data – for example, via OData interface to the SAP system, with clean rights transfer (principal propagation), so that everyone only sees what they are allowed to see. This is our core competence, and this is exactly where the greatest leverage lies.
- Support in development. Explain code, design tests, take care of routine work – completely offline, without a single line of source code leaving the house.
- Sensitive subject areas. Human resources, law, accounting: wherever personal or business-critical data is involved, local AI is often the only way to use AI in compliance with the rules at all.
Cloud or on-premises – the honest trade-off
We don’t sell you local AI as a panacea. There are good reasons to use the cloud: it’s ready to go, scales limitlessly, and always delivers the latest top-of-the-line models. If you only work occasionally and with non-critical content, it is often cheaper.
Switching to local is worthwhile if at least one of these points applies: Your data is sensitive and must not leave the house. They use AI regularly, so ongoing token costs are significant. Or you simply don’t want to be dependent on the prices and conditions of a single provider. For a team of about five to ten regular users, experience has shown that the bill tips in favor of their own hardware – a one-time investment instead of a permanent subscription fee.
In practice, by the way, the answer is rarely a strict either-or. Many companies take a two-pronged approach: the sensitive remains local, the non-critical is allowed to go to the cloud. A routing module such as LiteLLM distributes the requests automatically – the user does not notice anything.
Frequently asked questions about local AI
What is local AI?
On-premise AI refers to AI models that run on the company’s own hardware instead of via a provider’s cloud. All data remains in-house, there are no token costs and there is no connection to external APIs.
What hardware do you need for local AI?
The decisive factor is the GPU or its video memory (VRAM). A card with 24 to 32 GB is sufficient for models up to around 30 billion parameters. For 70B models and multiple parallel users, a professional card like the NVIDIA RTX PRO 6000 Blackwell with 96GB VRAM is suitable.
Is local AI GDPR compliant?
Local AI greatly simplifies data protection because personal data does not leave one’s own network. Commissioned processing by third parties and third-country transfers are no longer necessary. The GDPR obligations for internal processing remain in place, but are much easier to fulfill.
Which open source models are suitable for German-speaking companies?
Multilingual open-weight models such as the Qwen family (strong German, Apache 2.0 license), Mistral from France and Google Gemma are well suited. In practice, it is not so much the model size that counts as the combination of the right size, clean data preparation and good retrieval quality.
Is local AI worth it for small businesses?
Often from five to ten regular users. A one-time hardware investment in the low five-digit range replaces ongoing cloud fees, protects sensitive data and makes the usage volume plannable – without dependence on an external provider.


