AIMachine LearningDevelopersNVIDIA

What Is AI Inference? A Developer's Guide

March 16, 2026

AI inference is the process of feeding input to a trained model and getting a prediction back. If training is where a model learns patterns from data, inference is where it puts that learning to work: answering your question, classifying your image, or generating the next token in a sentence. Every time you hit an OpenAI endpoint or run a local LLM, you're doing inference.

NVIDIA just announced the Vera Rubin GPU architecture tonight at GTC 2026: 50 PFLOPS per chip, 288GB of HBM4 memory, and roughly 10x lower cost per token compared to the current Blackwell generation. That's a hardware shift that directly changes the economics of inference for every developer calling an AI API. This post covers what inference actually is, what determines how fast (and expensive) it gets, and how to work with the JSON responses that come back.

How does AI inference actually work?

Inference starts with a forward pass through a neural network. Your input — text, an image, audio — gets converted into numerical tensors and pushed through layers of matrix multiplications. For a transformer model (the architecture behind GPT-4, Claude, Llama, and most modern LLMs), this means attention calculations across your input tokens, followed by feedforward layers, repeated dozens or hundreds of times depending on the model's depth.

Here's a simplified version of what happens when you send a prompt to an API:

Your text gets tokenized: split into subword chunks the model understands. "Hello world" might become two tokens; a complex code snippet might be hundreds.
Those tokens become embedding vectors (arrays of floating-point numbers).
The vectors pass through the transformer's attention layers. Each layer weighs how much every token should "attend to" every other token.
The final layer outputs a probability distribution over the vocabulary; the model picks the most likely next token.
For generative models, this repeats token by token (autoregressive decoding) until the model produces a stop token or hits your max_tokens limit.

The key thing to understand: training adjusts billions of parameters over weeks using massive datasets. Inference uses those fixed parameters to process your specific input. Training is expensive and slow. Inference is (relatively) cheap and fast, but "cheap" is relative when you're serving millions of requests.

What affects inference latency and cost?

Latency depends on a few concrete factors, and understanding them helps you make better API calls.

Model size and precision. A 70-billion-parameter model requires more computation per token than a 7B model. Running in FP16 (16-bit floating point) halves the memory footprint compared to FP32, and INT8 quantization halves it again. Most cloud providers run inference in FP16 or BF16 by default. When NVIDIA talks about 50 PFLOPS on Vera Rubin, that's FP4 performance; the chip can push even more operations per second at lower precision, which is exactly what LLM inference needs.

Batch size. If the GPU processes one request at a time, most of its compute sits idle. Batching groups multiple requests together so the GPU's thousands of CUDA cores stay busy. Higher batch sizes improve throughput (tokens per second across all requests) but can increase latency for individual requests. It's a classic trade-off that inference providers tune constantly.

Context length. Transformer attention scales quadratically with sequence length in the basic implementation. A 4,000-token prompt costs roughly 4x the compute of a 2,000-token prompt for the attention layers (though optimizations like FlashAttention reduce this in practice). This is why API pricing often separates input and output tokens: input tokens are processed in parallel, output tokens are generated sequentially.

VRAM. The model's parameters have to fit in GPU memory. A 70B parameter model in FP16 needs about 140GB of VRAM. A single NVIDIA A100 has 80GB; a Blackwell B200 has 192GB; the new Vera Rubin architecture pushes that to 288GB of HBM4. More VRAM means larger models can run on a single chip without splitting across multiple GPUs (tensor parallelism), which reduces inter-chip communication overhead and cuts latency.

Hardware generation matters. Each GPU generation brings architectural improvements beyond raw FLOPS. Vera Rubin's 288GB HBM4 isn't just more memory; it's faster memory bandwidth, which matters because LLM inference is often memory-bandwidth-bound rather than compute-bound. The model weights need to be read from memory for every token generated. Faster memory = faster token generation.

Why do AI APIs return JSON (and how should you handle it)?

JSON is the standard response format for inference APIs because it's structured, language-agnostic, and easy to parse. When you call OpenAI's chat completions endpoint, Anthropic's messages API, or any hosted LLM, you get back a JSON object with the generated text, token counts, and metadata.

A typical response looks something like this:

{
  "id": "chatcmpl-abc123",
  "model": "gpt-4",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "AI inference is the process of..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 42,
    "completion_tokens": 156,
    "total_tokens": 198
  }
}

That usage object is where cost tracking lives. If you're paying $0.01 per 1,000 input tokens and $0.03 per 1,000 output tokens, you can calculate the exact cost of every request from those numbers. I'd recommend logging them; it adds up faster than you'd expect.

When you're debugging or inspecting these responses, raw JSON is hard to read in a terminal. The JSON Formatter handles that: paste in a response and you get syntax-highlighted, properly indented output. It's especially useful when you're comparing responses across different models or tracking down a parsing bug where the model returned malformed content inside the content field.

If you're also dealing with authentication tokens for your API calls, you might want to inspect those separately with the JWT Decoder.

What's the difference between inference and training?

Training is the computationally expensive process of adjusting a model's parameters to minimize a loss function across a dataset. It requires massive parallelism across hundreds or thousands of GPUs, takes days to weeks, and consumes megawatt-hours of power. Only the organizations building foundation models (OpenAI, Anthropic, Google, Meta) typically do this.

Inference runs the trained model with fixed weights. It's orders of magnitude cheaper per operation, but it runs constantly: every API call, every chatbot reply, every AI-powered search result. The aggregate compute spent on inference across the industry now exceeds training compute by a wide margin.

This is why NVIDIA's Vera Rubin announcement matters for working developers. You probably aren't training a model from scratch. But you are calling inference APIs, and the cost and latency of those calls directly affects what you can build. A 10x drop in cost per token means you can afford to use larger models, process longer contexts, or just spend less. The hardware improvements trickle down through cloud providers within 12-18 months of a new GPU generation shipping.

There's also a middle ground: fine-tuning. You take a pre-trained model and train it further on your specific data. It's cheaper than training from scratch (you're adjusting parameters, not learning from zero) but more expensive than pure inference. The result is a custom model that you then run inference on.

How can you optimize your inference costs?

Cost optimization comes down to using fewer tokens, using them more efficiently, and picking the right model for each task.

Use shorter prompts. Every token in your prompt costs money and adds latency. Strip unnecessary instructions. Use system prompts efficiently. If you're sending the same context repeatedly, consider fine-tuning instead; bake the knowledge into the model rather than prepending it to every request.

Choose the right model size. Not every task needs GPT-4 or Claude Opus. Classification, extraction, and simple Q&A often work fine with smaller, faster models. Use a large model for complex reasoning; use a small model for everything else. Some teams run a "router" that picks the model based on query complexity.

Cache when possible. If you're making similar requests repeatedly, cache the responses. Exact caches are easy; semantic caches (grouping similar-enough queries) are trickier but can cut costs significantly for applications with repetitive query patterns.

Stream responses. Streaming doesn't reduce total latency, but it dramatically improves perceived latency. The user sees tokens appearing within milliseconds instead of waiting seconds for the full response. Most inference APIs support server-sent events (SSE) for streaming.

Watch your token counts. Parse the usage field from every response (use the JSON Formatter to inspect them during development). Track input vs. output tokens separately; they're priced differently with most providers. Set max_tokens to reasonable limits so runaway generations don't blow your budget.

Quick reference

Term	What it means
Inference	Running a trained model on new input to get predictions
Token	A subword unit the model processes, roughly ¾ of an English word
Latency	Time from request to first (or complete) response
Throughput	Total tokens processed per second across all requests
Batch size	Number of requests processed simultaneously on a GPU
VRAM	GPU memory where model weights live during inference
CUDA cores	NVIDIA's parallel processing units that handle matrix math
FP16 / BF16	16-bit floating point formats used for efficient inference
Quantization	Reducing model precision (FP16 → INT8 → INT4) to save memory and speed up inference
Forward pass	One complete run of input through the model's layers
Autoregressive	Generating output one token at a time, each depending on the previous
Vera Rubin	NVIDIA's 2026 GPU architecture: 50 PFLOPS, 288GB HBM4, ~10x cheaper tokens vs Blackwell
Transformer	The neural network architecture behind GPT, Claude, Llama, and most modern LLMs

The economics of inference keep shifting. Vera Rubin GPUs won't ship to cloud providers overnight, but the trajectory is clear: running AI models gets cheaper and faster every generation. If you're building on top of inference APIs, structure your code to swap models easily and track your token usage from day one. Start by pasting a few API responses into the JSON Formatter to get comfortable with the response structure.

Try the tool mentioned in this article:

Open Tool →