ai-ml

How to Integrate GPT-4, Claude, and Gemini APIs Into Your App

The "pick one LLM and call it done" strategy is naive. Production AI applications in 2026 need to integrate GPT-4, Claude, and Gemini APIs together, routing int

ScribePilot AI

•January 29, 2026•9 min read

Pexels - How to Integrate GPT-4, Claude, and Gemini APIs Into Your App

How to Integrate GPT-4, Claude, and Gemini APIs Into Your App: A Practical Developer Guide for 2026

The "pick one LLM and call it done" strategy is naive. Production AI applications in 2026 need to integrate GPT-4, Claude, and Gemini APIs together, routing intelligently across providers based on cost, capability, and availability. Models go down. Pricing changes. A task that GPT-4o handles brilliantly might be overkill when Claude Haiku does the same job at a fraction of the cost. The applications that survive aren't loyal to one provider. They're model-agnostic by design.

This guide walks through exactly that: setting up all three SDKs, making equivalent calls side-by-side, building a thin abstraction layer with streaming and retry logic, and deploying it in a way that won't embarrass you in production.

1. Why Multi-Model Architecture Matters Now

Three drivers have made single-model apps fragile:

Resilience. Any API can go down. If your entire product depends on one provider's uptime, your product's uptime is their uptime.

Cost optimization. Not every task needs the most powerful model. Routing a simple classification task to a cheaper model while reserving expensive capacity for complex reasoning is straightforward to implement and meaningfully affects costs at scale.

Task-specific performance. No model is universally best. Benchmarks shift with every release, and performance varies dramatically by task type. A model-agnostic architecture lets you swap backends without rewriting business logic.

Open-source libraries like LiteLLM (v2.0+), LangChain (v0.2+), and the Vercel AI SDK (v3.0+) have made this pattern mainstream, according to their official documentation as of April 2026. But understanding what's underneath those abstractions matters when things break.

2. Environment Setup

Install all three SDKs in one shot (versions current as of PyPI, April 2026):

pip install openai==1.37.0 anthropic==0.25.0 google-generativeai==0.5.0

Store your keys as environment variables. Never hardcode them.

# .env file
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...

import os
from dotenv import load_dotenv
load_dotenv()

Authentication patterns differ slightly across providers. OpenAI and Anthropic use API keys exclusively. Google's Gemini API also supports service accounts for more robust application-level authentication, which is worth considering for anything running in GCP, according to Google AI Developer Documentation (April 2026).

Use a secrets manager (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault) in production. Environment variables in a .env file are fine for local dev, not for production deployments.

3. Side-by-Side: Basic Chat Completion

Here's the same prompt sent to all three APIs. Notice where the interfaces converge and where they diverge.

import openai
import anthropic
import google.generativeai as genai

# OpenAI GPT-4o
def call_openai(prompt: str) -> str:
    client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=512,
    )
    return response.choices[0].message.content

# Anthropic Claude 3 Sonnet
def call_claude(prompt: str) -> str:
    client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.content[0].text

# Google Gemini 1.5 Pro
def call_gemini(prompt: str) -> str:
    genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
    model = genai.GenerativeModel("gemini-1.5-pro-latest")
    response = model.generate_content(prompt)
    return response.text

The OpenAI and Anthropic interfaces are close enough that switching between them is mostly a matter of field names. Gemini's SDK is more divergent, especially around how the model is instantiated and how multimodal inputs are passed.

Token usage lives in different places too. OpenAI puts it in response.usage.prompt_tokens and response.usage.completion_tokens. Anthropic uses response.usage.input_tokens and response.usage.output_tokens. Gemini exposes it via response.usage_metadata.prompt_token_count and response.usage_metadata.candidates_token_count. This inconsistency is exactly why you need a normalization layer.

4. Building a Unified Abstraction Layer

A thin wrapper class normalizes inputs, outputs, token counts, error handling, and streaming. Here's a production-oriented version:

import time
import logging
from dataclasses import dataclass
from typing import Generator, Optional

logger = logging.getLogger(__name__)

@dataclass
class LLMResponse:
    text: str
    input_tokens: int
    output_tokens: int
    model: str

class LLMRouter:
    def __init__(self):
        self.openai_client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
        self.anthropic_client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
        genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

    def complete(
        self,
        prompt: str,
        provider: str = "openai",
        model: Optional[str] = None,
        max_retries: int = 3,
    ) -> LLMResponse:
        defaults = {
            "openai": "gpt-4o",
            "anthropic": "claude-3-sonnet-20240229",
            "gemini": "gemini-1.5-pro-latest",
        }
        model = model or defaults[provider]

        for attempt in range(max_retries):
            try:
                return self._dispatch(prompt, provider, model)
            except Exception as e:
                if attempt == max_retries - 1:
                    raise
                wait = 2 ** attempt  # exponential backoff: 1s, 2s, 4s
                logger.warning(f"Attempt {attempt + 1} failed ({e}). Retrying in {wait}s.")
                time.sleep(wait)

    def _dispatch(self, prompt: str, provider: str, model: str) -> LLMResponse:
        if provider == "openai":
            resp = self.openai_client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=512,
            )
            return LLMResponse(
                text=resp.choices[0].message.content,
                input_tokens=resp.usage.prompt_tokens,
                output_tokens=resp.usage.completion_tokens,
                model=model,
            )

        elif provider == "anthropic":
            resp = self.anthropic_client.messages.create(
                model=model,
                max_tokens=512,
                messages=[{"role": "user", "content": prompt}],
            )
            return LLMResponse(
                text=resp.content[0].text,
                input_tokens=resp.usage.input_tokens,
                output_tokens=resp.usage.output_tokens,
                model=model,
            )

        elif provider == "gemini":
            gemini_model = genai.GenerativeModel(model)
            resp = gemini_model.generate_content(prompt)
            return LLMResponse(
                text=resp.text,
                input_tokens=resp.usage_metadata.prompt_token_count,
                output_tokens=resp.usage_metadata.candidates_token_count,
                model=model,
            )

        else:
            raise ValueError(f"Unknown provider: {provider}")

Streaming Support

Streaming is non-negotiable for user-facing chat interfaces. Each SDK exposes it differently:

def stream_complete(self, prompt: str, provider: str = "openai") -> Generator[str, None, None]:
    if provider == "openai":
        stream = self.openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            stream=True,
        )
        for chunk in stream:
            delta = chunk.choices[0].delta.content
            if delta:
                yield delta

    elif provider == "anthropic":
        with self.anthropic_client.messages.stream(
            model="claude-3-sonnet-20240229",
            max_tokens=512,
            messages=[{"role": "user", "content": prompt}],
        ) as stream:
            for text in stream.text_stream:
                yield text

    elif provider == "gemini":
        gemini_model = genai.GenerativeModel("gemini-1.5-pro-latest")
        for chunk in gemini_model.generate_content(prompt, stream=True):
            if chunk.text:
                yield chunk.text

All three yield string chunks, so downstream consumers can treat them identically. That's the point.

Content Filter Normalization

Content policy handling is genuinely inconsistent across providers and deserves careful treatment in your abstraction. OpenAI and Anthropic return structured error codes (like content_policy_violation) as exceptions or finish reasons you can catch explicitly. Gemini behaves differently: it may return an empty response.text with prompt_feedback.block_reason populated instead. If you don't check for this, you'll silently return empty strings to users with no indication that the content was blocked.

Normalize this in _dispatch by raising a consistent custom exception:

elif provider == "gemini":
    resp = gemini_model.generate_content(prompt)
    if not resp.text:
        block_reason = getattr(resp.prompt_feedback, "block_reason", "UNKNOWN")
        raise ContentFilterException(f"Gemini blocked content: {block_reason}")

This isn't about circumventing safety features. It's about making failures visible and consistent so your application can respond appropriately, whether that's surfacing a user-facing message or logging for review.

5. Practical Differences That Affect Architecture

Context windows. According to OpenAI, Anthropic, and Google AI API Documentation (April 2026): GPT-4 Turbo and GPT-4o offer 128k tokens; Claude 3 models provide 200k tokens, with up to 1 million available for specific use cases upon request; Gemini 1.5 Pro and Flash support 1 million tokens by default. For document-heavy workloads, Gemini or Claude wins on context alone.

Rate limits. Default limits in early 2026 are generally around 60 requests per minute and 1.5 million tokens per minute for standard API tiers, though these vary by model and account tier and can be increased through enterprise agreements, per Provider Rate Limit Documentation (April 2026). Build rate limit detection into your retry logic. A 429 response should trigger backoff, not a crash.

Multimodal capabilities. All three providers support image inputs as of 2026. Gemini's multimodal support is natively integrated into its content generation pipeline, which makes it straightforward to pass mixed text and image content. GPT-4o and Claude handle vision through the messages API with typed content blocks. If your app processes images heavily, test all three providers and choose based on actual output quality for your specific use case.

6. Cost and Routing Strategy

Per-token pricing as of April 2026 (OpenAI, Anthropic, and Google AI Pricing Pages):

| Provider | Model | Input (per 1k tokens) | Output (per 1k tokens) | |---|---|---|---| | OpenAI | GPT-4 Turbo | $0.01 | $0.03 | | Anthropic | Claude 3 Opus | $0.015 | $0.075 | | Anthropic | Claude 3 Sonnet | $0.0015 | $0.0075 | | Google | Gemini 1.5 Pro | $0.0035 | $0.007 |

The cost spread is significant. A routing strategy that sends simple tasks to Claude Sonnet or Gemini Pro while reserving GPT-4 Turbo or Claude Opus for complex reasoning can meaningfully reduce costs at scale. A basic routing approach:

def route(self, prompt: str, task_complexity: str = "low") -> LLMResponse:
    token_estimate = len(prompt.split()) * 1.3  # rough estimate
    if task_complexity == "high" or token_estimate > 3000:
        return self.complete(prompt, provider="anthropic", model="claude-3-opus-20240229")
    elif task_complexity == "medium":
        return self.complete(prompt, provider="gemini")
    else:
        return self.complete(prompt, provider="anthropic", model="claude-3-sonnet-20240229")

Tune these thresholds against your actual workload. Token-count estimation is imprecise; use the tiktoken library for OpenAI-compatible counts as a closer proxy.

7. Production Considerations

Fallbacks. Wrap your primary provider call in a try/except that falls back to a secondary:

try:
    return self.complete(prompt, provider="openai")
except Exception:
    logger.error("OpenAI failed, falling back to Anthropic")
    return self.complete(prompt, provider="anthropic")

Token logging. Every LLMResponse object carries token counts. Log them. This is how you catch runaway prompts before they become runaway bills.

response = router.complete(prompt, provider="gemini")
logger.info(f"model={response.model} input={response.input_tokens} output={response.output_tokens}")

Usage policies. Each provider has distinct terms around data retention, training data opt-outs, and content usage. Enterprise tiers typically offer stronger privacy guarantees than consumer tiers. Read the policies for each provider before sending customer data through their APIs. This isn't optional.

Conclusion

Multi-model integration isn't a nice-to-have anymore. It's how production AI systems are built in 2026. The implementation isn't particularly complex once you accept that each SDK has its own quirks and design a normalization layer from the start rather than bolting one on later.

Start here:

Install all three SDKs, lock your versions, and keep your API keys out of source control
Build the LLMRouter class early, before you've committed to one provider's response format
Implement streaming and content filter normalization from the beginning
Add exponential backoff and fallback logic before you hit production
Log token counts on every call so you know what you're spending

The abstraction layer is the key investment. Once it's in place, swapping providers or adding routing rules is a configuration change, not a refactor.

GPT-4 API integrationClaude API tutorialGemini APILLM integration guide

ScribePilot.ai

This article was researched and written by ScribePilot — an AI content engine that generates high-quality, SEO-optimized blog posts on autopilot. From topic to published article, ScribePilot handles the research, writing, and optimization so you can focus on growing your site.

Try ScribePilot

Ready to Build Your MVP?

Let's turn your idea into a product that wins. Fast development, modern tech, real results.

Project Estimator

Get in Touch

ai-ml

How to Automate 80% of Customer Support With AI (Without Losing the Human Touch)

9 min read

ai-ml

AI Workflow Automation in 2026: The Tools and Patterns That Actually Ship

9 min read

ai-ml

Building an AI Content Moderation System That Actually Works: Architecture, Models, and Lessons from Production in 2026

9 min read

How to Integrate GPT-4, Claude, and Gemini APIs Into Your App: A Practical Developer Guide for 2026

1. Why Multi-Model Architecture Matters Now

2. Environment Setup

3. Side-by-Side: Basic Chat Completion

4. Building a Unified Abstraction Layer

Streaming Support

Content Filter Normalization

5. Practical Differences That Affect Architecture

6. Cost and Routing Strategy

7. Production Considerations

Conclusion

ScribePilot.ai

Ready to Build Your MVP?

Related Articles

How to Automate 80% of Customer Support With AI (Without Losing the Human Touch)

AI Workflow Automation in 2026: The Tools and Patterns That Actually Ship

Building an AI Content Moderation System That Actually Works: Architecture, Models, and Lessons from Production in 2026