API

OpenAI API Rate Limits: Tiers and How to Increase Them

Learn how OpenAI API rate limits work, what usage tiers mean, how to check your limits, and how to raise throughput without causing 429 errors.

API limits dashboard with RPM, TPM, and MONTHLY gauges beside a six-step tier ladder.

OpenAI API rate limits control how many requests, tokens, images, and batch tokens your organization can use over a given period. They are not one universal number. Limits vary by organization, project, usage tier, model, endpoint, and sometimes model family. The practical path is simple: check your Limits page, design around requests per minute and tokens per minute, handle 429 errors with backoff, and request a higher tier only after your production traffic shows a clear need. OpenAI says API rate limits are measured across RPM, RPD, TPM, TPD, and IPM, and usage tiers generally rise automatically as API spend increases.[1]

What OpenAI API rate limits measure

OpenAI API rate limits are caps on how much API traffic an organization or project can send. OpenAI defines them at the organization and project level, not at the individual user level, and says they vary by model.[1] That distinction matters when you have several apps sharing one organization. A staging script, a batch job, and a production service can all compete for the same model pool if you do not isolate them with separate projects and budgets.

The main units are requests per minute, requests per day, tokens per minute, tokens per day, and images per minute. OpenAI abbreviates those as RPM, RPD, TPM, TPD, and IPM.[1] You can hit a limit through any one of those dimensions. A service that sends many tiny prompts may hit RPM first. A service that sends fewer long prompts may hit TPM first.

Rate limits are different from model pricing. Pricing determines what you pay for usage. Rate limits determine how quickly you can use a model. If you are planning both cost and capacity, pair this article with our OpenAI API pricing guide and the OpenAI API cost calculator.

Five meter dials labeled RPM, RPD, TPM, TPD, and IPM with different needle positions.

Rate limits and usage limits are not the same

A rate limit controls short-window throughput. A usage limit controls monthly spend. OpenAI’s rate-limit documentation says organizations also have monthly API spending limits, known as usage limits.[1] You can be blocked by either one. A 429 error can mean you sent requests too quickly, but OpenAI’s error-code guide also lists a 429 quota error when you run out of credits or hit maximum monthly spend.[6]

Treat these as two dashboards. For throughput, watch RPM, TPM, reset headers, and queue depth. For billing, watch monthly spend, soft alerts, and hard caps. Our OpenAI API best practices for production article covers the broader operating checklist.

OpenAI API usage tiers

OpenAI’s public rate-limit guide lists the usage-tier qualifications and monthly usage limits below. These tiers do not guarantee the same RPM and TPM for every model. They define your account’s general level, and the model pages or Limits page show the exact caps that apply to each model.[1]

TierQualificationMonthly usage limitWhat it usually means
FreeUser must be in an allowed geography$100 / month[1]Testing and evaluation, if the model supports free access
Tier 1$5 paid$100 / month[1]Small prototypes and early paid experiments
Tier 2$50 paid and 7+ days since first successful payment$500 / month[1]Low-volume production apps
Tier 3$100 paid and 7+ days since first successful payment$1,000 / month[1]Growing production apps with regular traffic
Tier 4$250 paid and 14+ days since first successful payment$5,000 / month[1]Higher-throughput services with stronger traffic history
Tier 5$1,000 paid and 30+ days since first successful payment$200,000 / month[1]Large production workloads and enterprise-scale usage

OpenAI also says that, as API spend goes up, organizations are automatically graduated to the next usage tier, which usually increases rate limits across most models.[1] The word “usually” is important. Some models, endpoints, or preview features can still have separate availability rules or tighter limits.

Six ascending tier cards labeled FREE, T1 $100/MO, T2 $500/MO, T3 $1K/MO, T4 $5K/MO, and T5 $200K/MO.

Usage tiers are not ChatGPT subscription tiers

Do not confuse API usage tiers with ChatGPT Free, Plus, Pro, Team, Business, Enterprise, or Edu plans. API billing and ChatGPT subscriptions are separate products. If you are trying to decide whether you need a subscription or API billing, start with ChatGPT API vs ChatGPT Plus and Does ChatGPT Plus include API access?.

Why model limits differ by tier

A tier is only the starting point. OpenAI states that rate limits vary by model and that some model families share limits.[1] This is why two apps in the same organization can see different throughput. One app may call a low-latency model with generous TPM. Another may call a heavier reasoning model with a smaller pool.

OpenAI’s GPT-5 model page shows this pattern clearly. For GPT-5, Free is listed as not supported, while Tier 1 is listed at 500 RPM, 500,000 TPM, and a 1,500,000-token Batch queue limit; Tier 5 is listed at 15,000 RPM, 40,000,000 TPM, and a 15,000,000,000-token Batch queue limit.[8] Those numbers are model-specific. Do not copy them to another model without checking that model’s page or your Limits dashboard.

Limit typeWhat it capsCommon bottleneckCommon fix
RPMRequests sent per minuteMany short callsBatch small tasks into fewer requests
TPMInput and output token volume per minuteLong prompts or long completionsShorten prompts, cap output, or spread traffic
RPD / TPDDaily request or token volumeLarge daily jobsMove non-urgent work to Batch API
IPMImage requests per minuteConcurrent image generationQueue image jobs and limit parallelism
Batch queueQueued input tokens for batch jobsLarge offline processing runsChunk files and submit staged batches

Shared limits deserve special attention. If multiple aliases or snapshots appear under the same shared limit, calls to any of them can draw from the same pool.[1] That can surprise teams that split traffic across model names but still see one combined bottleneck.

How to check your current limits

The most reliable place to check your limits is the Limits section of your OpenAI account settings. OpenAI’s Help Center says all API usage is subject to rate limits and directs users to the settings Limits page to apply for an increase.[2] OpenAI’s rate-limit guide also says you can view rate and usage limits for your organization under the Limits section of account settings.[1]

  • Open the OpenAI Platform dashboard.
  • Select the correct organization and project.
  • Open the Limits page.
  • Check the model you plan to use.
  • Record RPM, TPM, daily limits, and batch queue limits where shown.
  • Confirm whether the model has a shared limit with related models.

If your account belongs to multiple organizations, confirm the default organization before testing. OpenAI’s production best-practices guide says usage counts against the organization specified for the API request, and if no organization header is provided, the default organization is billed.[3] A common mistake is testing in one organization and deploying under another.

Response header panel with rows labeled LIMIT, REMAIN, and RESET, plus progress bars and a clock.

Use response headers for live throttling

OpenAI documents rate-limit response headers for the current limit, remaining capacity, and reset timing. The published header fields include x-ratelimit-limit-requests, x-ratelimit-limit-tokens, x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, x-ratelimit-reset-requests, and x-ratelimit-reset-tokens.[1] Your application should log these headers for production traffic. They are more useful than guessing from averages.

Use headers to slow down before failures. If remaining requests or remaining tokens approach zero, pause low-priority work. If the reset value is short, wait and retry. If the reset value is long, move work to a queue rather than tying up web requests.

Line chart with Remaining capacity falling from 100 to 10, then refilling above Pause threshold 20.

How to increase OpenAI API rate limits

There are two main ways to increase OpenAI API rate limits: automatic tier graduation and a manual increase request. OpenAI says API spend can automatically move an organization to the next usage tier, and that this usually raises rate limits across most models.[4] The Help Center also says users who want higher limits can use the Limits page in settings and apply for an increase at the bottom of the page.[2]

Before you request an increase

Do a capacity audit first. OpenAI has not published an official approval time for manual rate-limit increases, so you should assume the request needs clear evidence. The best request explains the production use case, the model, expected traffic, current limits, peak RPM, peak TPM, monthly spend, and what mitigation you already implemented.

  • Identify the exact model and endpoint that bottleneck.
  • Export recent traffic logs showing real demand.
  • Show whether you hit RPM, TPM, daily limits, or monthly spend.
  • Confirm that retries use backoff and jitter.
  • Explain why batching, streaming, or smaller models do not solve the bottleneck.
  • Set a safe monthly budget before raising throughput.

If the issue is cost rather than throughput, increasing rate limits will not fix it. Use the OpenAI API cost calculator and then decide whether to reduce token volume, change models, cache outputs, or use Batch API for offline work.

Manual increase request template

Use a concise request. The goal is to make review easy.

Organization: [org name and ID]
Project: [project name]
Model and endpoint: [model] via [Responses API / Chat Completions / Batch / other]
Current bottleneck: [RPM / TPM / daily / batch queue / monthly usage]
Current limit: [value from Limits page]
Requested limit: [target value]
Use case: [short production description]
Traffic evidence: [peak and average usage from logs]
Mitigations already implemented: [queueing, backoff, batching, output caps]
Safety controls: [user quotas, abuse monitoring, budget cap]
Business impact: [what fails if the limit stays unchanged]

Keep the request specific. “We need more tokens” is weak. “Our production support classifier hits TPM on this model during weekday peaks even after queueing and shorter prompts” is stronger.

How to avoid 429 rate-limit errors

A 429 response can mean you are sending requests too quickly or that you exceeded quota. OpenAI’s error-code guide distinguishes “Rate limit reached for requests” from “You exceeded your current quota, please check your plan and billing details.”[6] Handle these cases differently. Throughput errors need pacing. Quota errors need billing, credits, or usage-limit changes.

OpenAI’s Help Center recommends exponential backoff for “Too Many Requests” errors and notes that unsuccessful requests still contribute to the per-minute limit.[5] That last point is the reason tight retry loops fail. A service that retries instantly can make the problem worse.

Retry timeline labeled REQUEST, 429, WAIT, RETRY, and SUCCESS with a delay arc at WAIT.

A practical retry policy

Use a retry policy that reads reset headers when available, adds jitter, and gives up after a bounded number of attempts. Do not retry every 429 forever. Send the failed job to a queue if it is safe to process later. Return a graceful message if the user needs an immediate answer.

try:
    call_openai()
except RateLimitError as error:
    if error_is_quota_related(error):
        alert_billing_owner()
        stop_retrying()
    else:
        wait_for_reset_or_backoff_with_jitter()
        retry_or_enqueue()

For a broader list of failures, see our OpenAI API errors breakdown. Rate-limit handling should live beside authentication, timeout, validation, and server-error handling. It should not be a one-off patch in a single endpoint.

Reduce the token estimate before you scale

OpenAI’s rate-limit guide says a request’s rate-limit calculation uses the maximum of max_tokens and an estimated token count based on request character count, and recommends setting max_tokens close to the expected response size.[1] This is one of the easiest fixes. If you reserve far more output than you need, you can reduce effective throughput even when actual completions are short.

Line chart with Request estimate flat at 300 and Rate-limit estimate flat to max_tokens 300 then rising to 1600.

Structured responses can also help because they make output length more predictable. If you need machine-readable JSON, review structured outputs with the OpenAI API and function calling in the OpenAI API.

Production design patterns for higher throughput

The best way to raise effective throughput is not always a higher tier. Often it is better routing. Split traffic by urgency, model, and latency tolerance. Keep user-facing requests fast. Move offline work out of the synchronous path. Add per-customer quotas so one customer cannot consume the whole organization’s pool.

Process: Ingress/all requests, Classify/urgency model, Route/sync async, Throttle/quotas, Execute/model batch, Observe/logs.
PatternUse it whenLimit it helpsTradeoff
Queue and worker poolTraffic arrives in burstsRPM and TPMAdds operational complexity
Batch multiple small tasksEach task is short and independentRPMResponses return together, not individually
Use Batch APIWork does not need an immediate responseStandard synchronous limitsOpenAI documents a clear 24-hour turnaround for Batch jobs[7]
Stream responsesUsers wait on long outputsPerceived latencyDoes not remove TPM constraints
Cap output lengthCompletions are longer than neededTPMMay require better prompts or schemas
Use smaller models for simple tasksClassification or extraction is routineCost and sometimes throughputRequires evaluation before switching

OpenAI says the Batch API has a separate pool of significantly higher rate limits, offers 50% lower costs, and has a 24-hour turnaround time.[7] That makes it a good fit for evaluations, content classification, embeddings jobs, and large offline transformations. Our OpenAI Batch API guide explains when it is worth the latency tradeoff.

Streaming can improve user experience, but it is not a rate-limit bypass. Use streaming responses with the OpenAI API when users need early tokens on screen. Use the Responses API when you want a modern endpoint for multimodal inputs, tools, structured outputs, and streaming in one workflow.

Finally, add internal guardrails. OpenAI’s rate-limit guide recommends caution with programmatic access, bulk processing, and automated social posting, and suggests setting usage limits for individual users over daily, weekly, or monthly windows.[1] That advice applies even if your organization has a high tier. A high global limit without per-user controls can still fail under abuse or a runaway job.

Frequently asked questions

Where do I find my OpenAI API rate limits?

Open the OpenAI Platform dashboard, select the correct organization and project, and check the Limits page. OpenAI’s Help Center says users can check rate limits there and apply for an increase at the bottom of that page.[2] Also check response headers in production logs because they show remaining capacity and reset timing.

What is the difference between RPM and TPM?

RPM is requests per minute. TPM is tokens per minute. You can hit RPM with many small calls or TPM with fewer large calls. OpenAI’s rate-limit guide says limits can be hit across any measured option depending on which one occurs first.[1]

How do I increase OpenAI API rate limits?

OpenAI says usage tiers generally increase automatically as API spend grows, and that users can review the Usage Limits page for information on advancing to the next tier.[4] You can also apply for an increase from the Limits page in account settings.[2] Include the model, endpoint, current bottleneck, traffic evidence, and safety controls in the request.

Does a higher usage tier increase limits for every model?

Not necessarily. OpenAI says rate limits vary by model and that some model families share rate limits.[1] Always check the specific model page or your Limits dashboard. Do not assume a higher tier gives identical RPM or TPM across all models.

Why am I getting 429 errors if I am below my per-minute limit?

OpenAI says rate limits can be quantized and enforced over shorter periods, so short bursts can trigger errors even if the full-minute average looks safe.[5] Your retry loop may also be contributing to the limit because unsuccessful requests count against per-minute capacity.[5] Add pacing, jitter, and queueing instead of retrying immediately.

Line chart with Burst traffic above Short-window cap while Full-minute average stays below it.

Can the Batch API help with rate limits?

Yes, if the work is not time-sensitive. OpenAI describes the Batch API as using a separate pool of significantly higher rate limits, with 50% lower costs and a 24-hour turnaround time.[7] It is a poor fit for live chat, but a strong fit for evaluations, classification, and large offline processing.

Editorial independence. chatai.guide is reader-supported and not affiliated with OpenAI. We don’t accept paid placements or sponsored reviews — every recommendation reflects our own testing.