Back to products
SelfHostLLM

SelfHostLLM

Calculate the GPU memory you need for LLM inference

Overview

What it is

Calculate GPU memory requirements and max concurrent requests for self-hosted LLM inference. Support for Llama, Qwen, DeepSeek, Mistral and more. Plan your AI infrastructure efficiently.

Intent

I need it when

Understand memory requirements for Mixture-of-Experts (MoE) models versus standard models

SelfHostLLM distinguishes between total parameters and active parameters for MoE models (Mixtral, DeepSeek V3, Qwen3 MoE, GLM-4.7), showing users they only need RAM for active experts, not the full model size, enabling more efficient deployments.

Plan GPU infrastructure requirements for self-hosted LLM deployment

SelfHostLLM calculates exact GPU memory needed, maximum concurrent requests, and performance estimates based on model size, quantization, context length, and hardware configuration. Users input their GPU setup and model choice to determine if their infrastructure can handle their workload.

Evaluate feasibility of running specific LLM models on available hardware

Users select from 100+ pre-configured models (Llama, Qwen, DeepSeek, Mistral, etc.) and their GPU hardware to instantly see if the model fits, how many concurrent requests are possible, and what context lengths are achievable with their setup.

Estimate token generation speed for different LLM models and hardware combinations

The tool provides performance ratings (tokens/sec) accounting for memory bandwidth, model efficiency, quantization boost, context length impact, and multi-GPU scaling. Users can compare different GPU models and configurations to find optimal cost-performance balance.

Optimize quantization strategy to balance model quality and inference speed

SelfHostLLM shows how different quantization levels (FP16, INT8, INT4, MXFP4) affect both memory requirements and token generation speed, helping users make informed trade-offs between model accuracy and performance.

Drop

Not a fit when

  • User needs a managed LLM hosting service rather than infrastructure planning calculations
  • User lacks technical knowledge to interpret GPU memory, KV cache, and quantization concepts
  • User requires actual LLM deployment or inference execution, not capacity planning
  • User needs support for non-GPU inference platforms like CPUs or specialized NPU hardware
  • User seeks real-time performance benchmarking rather than theoretical estimates
Commercials

Pricing

Free tool