Products / InferOS

InferOS

A self-hosted, production-grade LLM inference engine — built for concurrency, engineered to get the most out of every GPU. Runs anywhere from a laptop to a GPU fleet, fully offline, for everyone from solo builders to global enterprises.

Waitlist opens soonDrop-in OpenAI APIRuns fully offlineCPU · Single GPU · Multi-GPU

The first wave of launch users get one month of Pro, free — in exchange for their feedback as we shape the product.

Talk to us

Peak throughput

1,558tok/s

Llama 3.2 1B (4-bit) · 1× NVIDIA T4 · 32 concurrent

Single-user speed

139.8tok/s

Llama 3.2 1B (4-bit) · 1× NVIDIA T4 · 1 request

Reliability

0errors

across 1,408 served requests

Coverage

1,000+models

10+ architecture families tested, plus your fine-tunes

01 / What it is

Your own AI, on your own terms.

InferOS is a self-hosted inference engine that serves popular open models faster, on less hardware, than teams expect — so you get more concurrent users per GPU and predictable latency on the cards you already run, instead of renting scarce top-tier accelerators.

It speaks the OpenAI API, so your existing apps and SDKs point at it unchanged — and it ships the multi-tenant isolation, encryption, and audit trail that regulated teams need but general-purpose tools leave to you. Nothing leaves your infrastructure.

02 / Performance

Built for concurrency and production.

On a single NVIDIA T4, throughput scales with load instead of flatlining — past 1,500 tokens per second on Llama 3.2 1B, and several hundred on larger 3B-class models, on one card. Continuous batching keeps it climbing: even at 64 simultaneous requests it still serves around 1,465 tokens per second. Full test conditions below.

GPU

1× NVIDIA T4 (16 GB)

Precision

4-bit (Q4_K_M)

Output

128 tokens

Temperature

0.7

Environment

Closed / isolated

Llama 3.2 1B · tokens/sec by concurrent requests (C)

C=1

140

tok/s

C=4

377

tok/s

C=8

671

tok/s

C=16

1,107

tok/s

C=32

1,558

tok/s

C=64

1,465

tok/s

At 16 concurrent requests · tokens/sec by model

Llama 3.2 · 1B

1,107

tok/s

Qwen 2.5 · 3B

456

tok/s

Llama 3.2 · 3B

454

tok/s

Phi-3.5 · mini

329

tok/s

Production-shaped traffic

860 tok/s

Staggered, real-world arrivals at a 1.64 s median latency — same Llama 3.2 1B on the same T4.

Repetitive & extractive text

up to 2.5× faster

Versus standard decoding on the same model — with byte-identical output. The speed-up is lossless.

All performance figures were measured in a controlled, closed environment on a single NVIDIA T4 (16 GB), 4-bit weights, 128-token generations at temperature 0.7 — the current generation of our published results. The concurrency curve uses Llama 3.2 1B, each cell taken at the best batching configuration for that concurrency; the by-model figures are each model's throughput at 16 concurrent requests. Throughput varies with hardware, model, quantisation, and load; per-stream rates fall as concurrency rises, the same physics every engine faces on a given card. These are our own measurements, not a benchmark of any third-party product.

03 / Runs anywhere you do

From a laptop to a GPU fleet.

The same engine scales with you — start on a CPU for development, serve production on one card, and grow to many when you need to. No rewrite in between.

Develop on a laptop

A CPU-only mode lets engineers build and test against the real engine with no GPU at all. Spin it up anywhere.

Ship on one modest GPU

Production-grade serving on a single mainstream GPU — the kind most teams already own, not scarce top-tier silicon.

Scale across many GPUs

Grow from one card to a multi-GPU server as demand rises, without changing a line of how your apps talk to it.

Run it your way

Ships as a container with a Kubernetes Helm chart. Deploy on-prem, in your private cloud, or fully air-gapped.

04 / Safety & sovereignty

We value safety and sovereignty.

Self-hosted means more than convenient — it means control. Your models, your data, your infrastructure, with isolation and compliance posture built in rather than bolted on.

Switch off the internet

InferOS runs entirely on your own hardware and never phones home. Disconnect it from the network completely and it keeps serving.

Your data never leaves

Prompts and responses stay inside your walls — nothing is sent to us or any third party. Sovereignty by default, not by add-on.

True multi-tenancy

Every tenant gets its own keys, rate limits, usage metrics, and separate audit logs — enforced end-to-end, never shared by accident.

Compliance-ready by design

Engineered and tested to support the controls regulated teams depend on — encryption in transit and at rest, audit trails, and configurable retention. We don’t yet hold formal certifications, and we won’t claim badges we haven’t earned.

05 / Observability

You can see what every tenant is doing.

Time-to-first-token and per-token latency at p50 / p95 / p99
Goodput — the share of requests that actually meet your latency targets
Per-tenant usage, cache efficiency, and live GPU health
A metrics endpoint with a ready-to-run dashboard out of the box

06 / Model coverage

Thousands of open models. Plus yours.

Llama 3.1 / 3.2Qwen 2.5 / Qwen 3MistralPhi-3.5Gemma 2 / 3Starcoder2Command-RYour own fine-tunes

Thousands of open models run today, across 10+ architecture families we've tested. Bring your own fine-tuned model — if it's a supported architecture, it just works. No lock-in to any one vendor's models, and more architectures are in active validation.

07 / Community

Build it with us.

A community space is launching alongside InferOS — a place to compare notes, raise issues, request models, and help shape the roadmap. The earliest users have the loudest voice.

Talk to us →

Launch

Waitlist opens soon

Be first in line.

The waitlist opens soon, and it's for the first launch users — who get one month of Pro, free, in exchange for their feedback as we shape InferOS. Check back shortly.