SIMPLE LLM API
90% LOWER COST

Inference.net is a global network of data centers serving fast, scalable, pay-per-token APIs for models like Gemma 3 and Llama 3.2. Connect in minutes. Scale forever.

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "https://api.inference.net/v1",
  apiKey: process.env.INFERENCE_API_KEY,
});

const completion = await openai.chat.completions.create({
  model: "deepseek/deepseek-r1-0528/fp-8",
  messages: [
    {
      role: "user",
      content: "What is the meaning of life?"
    }
  ],
  stream: true,
});

for await (const chunk of completion) {
  process.stdout.write(chunk.choices[0]?.delta.content as string);
}

Google Gemma 3

Gemma 3 is a versatile, lightweight, multimodal open-source model family by Google DeepMind, primed for text and image processing and text generation, supporting over 140 languages with a 128K context window, designed for easy deployment in resource-constrained environments.

$0.15 / $0.30

125K Context

JSON

Llama 3.2 11B Vision Instruct

Llama 3.2-Vision, developed by Meta, is a state-of-the-art multimodal language model optimized for image recognition, reasoning, and captioning, surpassing both open and closed models in industry benchmarks.

$0.055 / $0.055

16K Context

JSON

Llama 3.1 8B Instruct

Meta Llama 3.1 is a collection of advanced, multilingual large language models designed for dialogues, available in 8B, 70B, and 405B sizes, that outperform many chat models on industry benchmarks and emphasize safe, responsible use in various applications.

$0.025 / $0.025

16K Context

JSON

Start for free

Begin with $25 in free credits to explore our models via the Playground.

Integrate in minutes

Switch to Inference.net by changing a single line of code. Start saving today.

Pay-as-you-go

Only pay for what you use. Set limits and monitor usage via our dashboards.

TEXT-TO-TEXT

Prices shown are per 1 million tokens

Model	Quantization	Input	Output
Llama 3.1 8B Instruct	FP16	$0.02	$0.03
Llama 3.2 1B Instruct	FP16	$0.01	$0.01
Llama 3.2 3B Instruct	FP16	$0.02	$0.02
Mistral Nemo 12B Instruct	FP8	$0.038	$0.10
Osmosis Structure 0.6B	FP32	$0.10	$0.50

IMAGE-TO-TEXT

Prices shown are per 1 million tokens

Model	Quantization	Input	Output
Google Gemma 3	BF16	$0.15	$0.30
Llama 3.2 11B Vision Instruct	FP16	$0.055	$0.055
Qwen 2.5 7B Vision Instruct	BF16	$0.20	$0.20

NEED A RESEARCH GRANT?

Inference’s Grants program offers free compute resources to researchers and developers working on open-source AI projects. Fill out an application and our team will be in touch within 24 hours.

NEED ENTERPRISE PRICING?

Inference is the best solution for large scale operations looking to source affordable inference compute. Leverage our network's capabilities and our team's expertise for your next initiative.

PLAYGROUND

Total Cost = $0.00

Time To First Token

0ms

Tokens Per Second

Total Tokens

Total Cost = $0.00

TTFT:

0ms

TPS:

Total Tokens:

Total Cost = $0.00

Time To First Token

0ms

Tokens Per Second

Total Tokens

Total Cost = $0.00

TTFT:

0ms

TPS:

Total Tokens:

Type a message to get started

System Prompt

Tweak the overall style and tone of the conversation.

Temperature

0.7

Control how creative you'd like the model to be when responding to you.

Output length

1,024

Set the maximum token length of generated text.

90% LOWER COST FOR THE SAME TOP MODELS

Our pricing is up to 90% lower than other providers, with the same enterprise-grade reliability.

Calculate Your Savings

Select a model

Tokens used per month

Input tokens: 700M | Output tokens: 300M

Price on Together.ai

$180/mo

Input: $0.18 / million tokens | Output: $0.18 / million tokens

Price on

$55/mo

Save 69%

Input: $0.055 / million tokens | Output: $0.055 / million tokens

OPENAI SDK COMPATIBLE

Our APIs are OpenAI-compatible. Switch in under two minutes and start saving. A two-line code change is all you need.

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "https://api.inference.net/v1",
  apiKey: process.env.INFERENCE_API_KEY,
});

const completion = await openai.chat.completions.create({
  model: "deepseek/deepseek-r1-0528/fp-8",
  messages: [
    {
      role: "user",
      content: "What is the meaning of life?"
    }
  ],
  stream: true,
});

for await (const chunk of completion) {
  process.stdout.write(chunk.choices[0]?.delta.content as string);
}

USE CASE

REAL-TIME CHAT

Powerful serverless inference APIs that scale from zero to billions.

Top-Tier Performance

Industry-leading latency and throughput powered by highly optimized GPU infrastructure.

Unbeatable Pricing

Up to 90% cost savings vs legacy providers. Only pay for what you use, and never a penny more.

Easy Integration

First-class support for LangChain, LlamaIndex and other popular LLM frameworks.

USE CASE

BATCH INFERENCE

Process millions of requests per batch with a single API call.

Unmatched Scale & Cost

We handle the largest asynchronous LLM workloads at the lowest prices on the market.

Build Advanced Workflows

Power massive-scale data analysis, synthetic data generation, document processing, and more with our batch API.

Built for Developers

Easy to integrate. Find the code samples and documentation you need, when you need it.

USE CASE

DATA EXTRACTION

Transform unstructured data into actionable insights with powerful schema validation and parsing.

Precise Extraction

Extract structured data with guaranteed schema compliance using JSON Schema validation. Handle complex nested objects with confidence.

Flexible Processing

Process data at scale with our Batch API, or stream response objects in real-time as they are generated.

Familiar Tooling

First-class SDK support for TypeScript, Python, and more. Support for popular validation tools like Pydantic and Zod.

JOIN THOUSANDS OF DEVS
BUILDING THE FUTURE

Arib Khan

Founder, 24labs.ai

We saved over $20k per month by switching to inference.net.

Joel Martin

Founder, SiteKick.co

We were struggling to find a provider that had the features we needed and didn't cost an arm and a leg. Inference.net was the perfect fit.

Rhys Sullivan

Product Engineer, Vercel

If open source models are at the quality you need, inference.net may be helpful

Michael Hess

Co-founder & CTO, Outset.ai

We use Inference.net for some of structured output tasks. It's a great product.

Mike Pollard

Founder, Mikeathon

We checked prices for all the top providers. Inference.net was the cheapest by a mile.

2 MINUTES TO INTEGRATE

We designed our API from scratch to make integration as easy as possible. It takes only two minutes to fully integrate. Switch today. Satisfaction guaranteed.

END-TO-END GENERATIVE AI

We power the most comprehensive generative AI workflows for your application without missing a beat. Get tokens to your users at blazing fast speeds.

AFFORDABLE AT EVERY SCALE

We built custom AI-native orchestration and scheduling software to ensure you always get the best prices without compromising on performance.

TO INFINITY AND BEYOND

We regularly update our model catalog when new models are released, so you always have access to the latest and greatest AI models.

SIMPLE LLM API90% LOWER COST

Google Gemma 3

Llama 3.2 11B Vision Instruct

Llama 3.1 8B Instruct

Start for free

Integrate in minutes

Pay-as-you-go

TEXT-TO-TEXT

IMAGE-TO-TEXT

NEED A RESEARCH GRANT?

NEED ENTERPRISE PRICING?

PLAYGROUND

90% LOWER COST FOR THE SAME TOP MODELS

Calculate Your Savings

Price on Together.ai

Price on

$55/mo

Input: $0.055 / million tokens | Output: $0.055 / million tokens

OPENAI SDK COMPATIBLE

REAL-TIME CHAT

Top-Tier Performance

Unbeatable Pricing

Easy Integration

BATCH INFERENCE

Unmatched Scale & Cost

Build Advanced Workflows

Built for Developers

DATA EXTRACTION

Precise Extraction

Flexible Processing

Familiar Tooling

JOIN THOUSANDS OF DEVS BUILDING THE FUTURE

2 MINUTES TO INTEGRATE

END-TO-END GENERATIVE AI

AFFORDABLE AT EVERY SCALE

TO INFINITY AND BEYOND

START BUILDING TODAY

SIMPLE LLM API
90% LOWER COST

JOIN THOUSANDS OF DEVS
BUILDING THE FUTURE