Google Gemma 3

Gemma 3 is a versatile, lightweight, multimodal open-source model family by Google DeepMind, primed for text and image processing and text generation, supporting over 140 languages with a 128K context window, designed for easy deployment in resource-constrained environments.
Llama 3.2 11B Vision Instruct

Llama 3.2-Vision, developed by Meta, is a state-of-the-art multimodal language model optimized for image recognition, reasoning, and captioning, surpassing both open and closed models in industry benchmarks.
Llama 3.1 8B Instruct

Meta Llama 3.1 is a collection of advanced, multilingual large language models designed for dialogues, available in 8B, 70B, and 405B sizes, that outperform many chat models on industry benchmarks and emphasize safe, responsible use in various applications.
Start for free
Begin with $25 in free credits to explore our models via the Playground.
Integrate in minutes
Switch to Inference.net by changing a single line of code. Start saving today.
Pay-as-you-go
Only pay for what you use. Set limits and monitor usage via our dashboards.
TEXT-TO-TEXT
Prices shown are per 1 million tokens
Model | Quantization | Input | Output |
---|---|---|---|
Llama 3.1 8B Instruct | FP16 | $0.02 | $0.03 |
Llama 3.2 1B Instruct | FP16 | $0.01 | $0.01 |
Llama 3.2 3B Instruct | FP16 | $0.02 | $0.02 |
Mistral Nemo 12B Instruct | FP8 | $0.038 | $0.10 |
Osmosis Structure 0.6B | FP32 | $0.10 | $0.50 |
IMAGE-TO-TEXT
Prices shown are per 1 million tokens
Model | Quantization | Input | Output |
---|---|---|---|
Google Gemma 3 | BF16 | $0.15 | $0.30 |
Llama 3.2 11B Vision Instruct | FP16 | $0.055 | $0.055 |
Qwen 2.5 7B Vision Instruct | BF16 | $0.20 | $0.20 |
PLAYGROUND
Total Cost = $0.00
Time To First Token
0ms
Tokens Per Second
0
Total Tokens
0
Total Cost = $0.00
Time To First Token
0ms
Tokens Per Second
0
Total Tokens
0
Type a message to get started
Tweak the overall style and tone of the conversation.
Control how creative you'd like the model to be when responding to you.
Set the maximum token length of generated text.
Calculate Your Savings
Price on Together.ai
$180/mo
Input: $0.18 / million tokens | Output: $0.18 / million tokens
Price on
$55/mo
Save 69%Input: $0.055 / million tokens | Output: $0.055 / million tokens
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "https://api.inference.net/v1",
apiKey: process.env.INFERENCE_API_KEY,
});
const completion = await openai.chat.completions.create({
model: "deepseek/deepseek-r1-0528/fp-8",
messages: [
{
role: "user",
content: "What is the meaning of life?"
}
],
stream: true,
});
for await (const chunk of completion) {
process.stdout.write(chunk.choices[0]?.delta.content as string);
}
REAL-TIME CHAT
Powerful serverless inference APIs that scale from zero to billions.
Top-Tier Performance
Industry-leading latency and throughput powered by highly optimized GPU infrastructure.
Unbeatable Pricing
Up to 90% cost savings vs legacy providers. Only pay for what you use, and never a penny more.
Easy Integration
First-class support for LangChain, LlamaIndex and other popular LLM frameworks.
BATCH INFERENCE
Process millions of requests per batch with a single API call.
Unmatched Scale & Cost
We handle the largest asynchronous LLM workloads at the lowest prices on the market.
Build Advanced Workflows
Power massive-scale data analysis, synthetic data generation, document processing, and more with our batch API.
Built for Developers
Easy to integrate. Find the code samples and documentation you need, when you need it.
DATA EXTRACTION
Transform unstructured data into actionable insights with powerful schema validation and parsing.
Precise Extraction
Extract structured data with guaranteed schema compliance using JSON Schema validation. Handle complex nested objects with confidence.
Flexible Processing
Process data at scale with our Batch API, or stream response objects in real-time as they are generated.
Familiar Tooling
First-class SDK support for TypeScript, Python, and more. Support for popular validation tools like Pydantic and Zod.
2 MINUTES TO INTEGRATE
We designed our API from scratch to make integration as easy as possible. It takes only two minutes to fully integrate. Switch today. Satisfaction guaranteed.
END-TO-END GENERATIVE AI
We power the most comprehensive generative AI workflows for your application without missing a beat. Get tokens to your users at blazing fast speeds.
AFFORDABLE AT EVERY SCALE
We built custom AI-native orchestration and scheduling software to ensure you always get the best prices without compromising on performance.
TO INFINITY AND BEYOND
We regularly update our model catalog when new models are released, so you always have access to the latest and greatest AI models.