GroqCloud is a high-performance AI inference platform built to deliver ultra-low latency, predictable cost, and production-grade reliability for real-world applications.
Groq delivers extremely fast token throughput and response times across leading open models, enabling real‑time chat, document analysis, and interactive user experiences. Customers report transformative speedups that unlock workloads previously constrained by latency.
Run GPT‑OSS (20B/120B), Llama 4 (Scout, Maverick), Llama 3.3 70B, Llama 3.1 8B, Qwen3 32B (131k context), Kimi K2, Whisper ASR, and Canopy Orpheus TTS. Linear, per‑unit pricing (tokens, characters, hours) makes costs transparent and easy to plan.
Automatic prompt caching on GPT‑OSS models gives a 50% discount on cached input tokens and reduces latency—no configuration required. Cached tokens also don’t count toward GroqCloud rate limits, stretching your quotas further.
Groq’s Compound Systems intelligently select tools (web search, browser automation, code execution) to answer queries, with pass‑through pricing. Built‑in tools are available by the request or hour, enabling robust, tool‑using agents without extra infrastructure.
Drop‑in compatibility with the OpenAI Responses API and Anthropic’s Model Context Protocol lets you connect external tools (GitHub, browsers, databases) with zero code changes. Bring your own MCP servers and keys; third‑party fees billed by providers.
Pricing is linear with no hidden costs or elastic pricing. Prompt caching currently applies to GPT‑OSS models (50% discount on cached input tokens; cache hits don’t count toward rate limits). Built‑in tools are billed per 1,000 requests or per hour as listed. ASR is billed per hour with a 10s minimum per request. Batch API offers ~50% lower cost and does not consume standard rate limits. All prices USD.
We saw a drastic jump in performance and latency—everything became much faster. That time savings is transformative for our customers.
Prompt caching will be game‑changing for both speed and quality. With high prompt reuse, it accelerates our product and unlocks new use cases.
Using Groq, we reduced costs by 15x and transcription is 20x faster, lifting conversions by 30% and cutting churn in half.
Groq Compound unlocked our knowledge graph—entity extraction and linking are fast and 10x cheaper than Google’s grounded search.
JSON output, tool use, and consistent service in sensitive environments are crucial. GroqCloud gives us the flexibility to meet strict requirements.
Yes. You can sign up and try Groq for free in the Developer Console and Playground, then call models via an OpenAI‑compatible Responses API. Prompt caching works automatically—no configuration required.
Pricing is linear and usage‑based. LLMs are billed per million input/output tokens by model (e.g., GPT‑OSS‑20B $0.075/$0.30; GPT‑OSS‑120B $0.15/$0.60; Llama 4 Scout $0.11/$0.34; Qwen3 32B $0.29/$0.59). ASR (Whisper) is per hour with a 10s minimum; TTS (Orpheus) is per million characters. Built‑in tools (search, visit website, code, browser automation) are billed per 1,000 requests or per hour.
Yes. You can get started for free on GroqCloud and upgrade to a paid tier as your needs grow. Paid tiers let you scale without standard rate limits.
Groq supports GPT‑OSS (20B/120B), Llama 4 Scout and Maverick, Llama 3.3 70B, Llama 3.1 8B, Qwen3 32B (131k context), Kimi K2 0905, Whisper Large V3/V3 Turbo (ASR), and Canopy Orpheus (TTS). Additional and fine‑tuned models are available for enterprise requests.
Prompt caching detects identical prompt prefixes and reuses computation, reducing latency and giving a 50% discount on cached input tokens for GPT‑OSS models. There’s no extra fee for caching; discounts apply only on cache hits. Cached tokens also don’t count toward GroqCloud rate limits.
Yes. Groq offers built‑in tools and Compound Systems for search, browsing, and code execution. Remote MCP (beta) lets you connect any MCP server (e.g., GitHub, browsers, databases) through the OpenAI‑compatible Responses API. You pay Groq for model tokens; third‑party MCP fees are billed by the provider.
Yes. The Batch API processes large asynchronous workloads at ~50% lower cost with a 24‑hour to 7‑day processing window and no impact on standard rate limits.
Yes. For enterprise API solutions and on‑prem deployments, contact Groq via the Enterprise Access page. Enterprise also offers access to additional models, dedicated support, and custom SLAs.
Join thousands of developers who are already using Groq to enhance their workflow and productivity.
LangChain is an end-to-end agent engineering stack that helps teams build, observe, evaluate, and deploy reliable AI agents.
Anthropic builds Claude, a family of frontier AI models and tools designed to be safe, reliable, and useful for both individuals and organizations.
Claude is a next-generation AI assistant from Anthropic designed to help individuals and teams create, code, research, and analyze faster with strong safety and reliability.
Google AI Studio is a developer-focused platform that streamlines the journey from prompt to production with Gemini and other Google AI models.
OpenAI o1 is a new family of frontier reasoning models designed to spend more time thinking before they respond, enabling stronger performance on complex tasks in science, coding, and math.