Skip to content
robinSenior Software Engineer - Applied AI
All articlesBuilding One of the First ChatGPT Apps

Building One of the First ChatGPT Apps

15 Nov 2025 · 11 min read · 2,032 words

I spent the last few weeks building one of the first ChatGPT Apps on top of OpenAI’s Apps SDK and the Model Context Protocol. What came out of it is a production MCP server, a UI that runs inside ChatGPT’s iframe, and a much clearer sense of what it means to build an AI‑native app when the model is not the product, just the runtime.

1759777810867.jpeg

The core of the project is an MCP server that exposes a set of tools and widgets that ChatGPT can call to search classes, fetch details, build training plans, and schedule workouts, all from inside a chat. The Apps SDK treats that server as the backend for a ChatGPT App, and renders our UI inside a sandboxed iframe whenever the model decides to use us.

At a high level, the system looks like this:

Loading diagram...

The interesting part is not that the model can call tools- we’ve all seen function calling- it’s that those tools can now return structured UI and run code in the user’s browser, inside ChatGPT’s own chrome. That forced a bunch of design decisions that look a lot more like classic web performance tuning than “AI magic.”

MCP Server Architecture

I built the MCP server as a Next.js app running on Vercel with a thin protocol layer on top of the official SDK. The server is responsible for three things: implementing the MCP protocol, registering tools, and serving widget HTML.

The protocol layer wires up JSON‑RPC handlers, registers tools, and exposes resources via URIs like ui://widget/search.html. It doesn’t know anything about workouts or classes; it just routes requests and enforces types.

// lib/mcp/server.ts
import { Server } from "@modelcontextprotocol/sdk/server";
import { tools } from "./tools/registry";
import { getResource } from "./static-resources";

export function createMcpServer() {
  const server = new Server({
    name: "peloton-mcp",
    version: "1.0.0",
  });

  tools.forEach((tool) => {
    server.tool(tool.name, tool.schema, tool.handler);
  });

  server.resource("ui://widget/:name", async ({ params }) => {
    const html = await getResource(params.name);
    return {
      contentType: "text/html",
      body: html,
    };
  });

  return server;
}

This is the pattern that generalises: the MCP layer stays boring and standards‑compliant, and all the app‑specific logic lives in tools and helpers behind it.

The Tool System

The MCP server exposes tools grouped by what they touch: search and discovery that are safe to call without user context, and richer tools that read or update account‑specific state. Each tool declares its own metadata so the host can reason about risk and capabilities.

A typical tool definition looks like this:

// lib/mcp/tools/search.ts
import { z } from "zod";
import { defineTool } from "@modelcontextprotocol/sdk";
import { searchClasses } from "./utils/searchEngine";

export const searchTool = defineTool({
  name: "search_classes",
  description: "Search Peloton classes by query and filters.",
  inputSchema: z.object({
    query: z.string(),
    limit: z.number().min(1).max(25).default(10),
  }),
  security: {
    is_read_only: true,
    is_destructive: false,
    is_open_world: false,
    securitySchemes: ["noauth"],
  },
  async handler({ input }) {
    const results = await searchClasses(input.query, input.limit);
    return { classes: results };
  },
});

Write tools flip those flags and require a different security scheme before doing anything that touches user state, but the implementation shape stays the same. Keeping that metadata close to the code makes it obvious what can go wrong and where we need extra checks.

Let’s add a new section after the high‑level architecture and before the widget rendering section.

Building an external search API that could take a punch

The hardest part of this project wasn’t wiring up MCP or drawing boxes on an architecture diagram. It was building a search API that could sit in front of a large, rights‑constrained catalogue and survive being called by a global chat surface.

We needed to expose only the classes we were allowed to show externally, and we needed to make them discoverable in a way that felt natural in conversation: “20‑minute Cody ride”, “90s rock run”, “beginner yoga from last year”. That meant indexing enough metadata to answer those queries, and doing it without dragging a full database into every serverless function.

I ended up with a discipline‑sharded, enriched index that lives behind Vercel’s blob storage and edge cache. There are a dozen shards, one per fitness discipline, each containing compressed JSON with just enough data to power search: instructor names, music artists, duration, difficulty, air date, and a few lightweight category tags.

At query time, the flow is simple:

type ParsedQuery = {
  disciplines: string[];       // e.g. ["cycling"]
  instructors: string[];       // recognized names
  artists: string[];           // recognized artists
  duration?: { min: number; max: number };
  difficulty?: string;
  year?: number;
  terms: string[];             // remaining tokens
};

async function searchClasses(raw: string, limit = 10) {
  const parsed = parseQuery(raw);            // extract filters & terms
  const shards = await loadShards(parsed);   // pull just the needed disciplines

  const scored: { cls: Class; score: number }[] = [];

  for (const shard of shards) {
    for (const cls of shard.classes) {
      if (!hardMatches(cls, parsed)) continue;

      const score = relevanceScore(cls, parsed);
      if (score <= 0) continue;

      scored.push({ cls, score });
    }
  }

  const deduped = dedupeBySlug(scored);
  deduped.sort((a, b) => b.score - a.score);

  return deduped.slice(0, Math.min(limit, 15)).map(x => x.cls);
}

The parser does the boring work: if a token matches a known instructor, it becomes an instructor filter; if it matches a known artist, it goes into the music filter; numbers and ranges map to duration or year; the rest become free‑text terms. The important bit is that everything downstream sees a structured query, not a bag of words.

Shard loading is discipline‑aware. If the query clearly targets “running”, we load just the running shard; if it’s ambiguous, we fan out to all twelve in parallel. Each shard is a gzipped JSON blob, sitting behind Vercel’s edge cache so the first request pays the cost and the rest are effectively local for that region.

Filtering is strict where it needs to be. Discipline, duration, difficulty, and year are treated as hard filters. Search terms have to match on word boundaries in the title or categories; a loose substring match across every description would have made recall look better in a benchmark but worse in a real conversation. Once a class passes the filters, it gets a relevance score that leans heavily on the structured metadata: instructors are weighted highest, then artists, then title, with categories and description trailing behind.

That balance gives the model a predictable surface. “20‑minute ride with [instructor]” will usually pull exactly what you’d expect; “power zone ride with 90s rock” will bias toward classes where both the coach and the music metadata line up. When the model chains tools search first, plan later it’s operating on a consistent set of results that were all evaluated against the same rules.

The nice thing about this setup is that it scales with traffic without getting clever. The index is precomputed, the blobs are cached at the edge, and the search path is mostly CPU work over in‑memory data. Vercel’s CDN and blob caching handle the heavy lifting of distribution, and the API code stays small enough to understand.

From a distance, it looks like an “AI search” feature. Up close, it’s just classic information retrieval tuned to sit behind a chat interface and answer whatever the model throws at it.

Rendering Widgets Inside ChatGPT

The UI runs entirely in the browser inside ChatGPT’s iframe, which means the page is loaded once and then hydrated repeatedly as the model calls tools and streams data into it. I started with Next.js and server‑rendered components, but every server round‑trip added 100-200 ms for HTML generation and hydration on top of tool latency and model thinking time.

So I ended up treating the UI as static HTML templates with a tiny client‑side runtime. At build time a script compiles Tailwind, bundles Preact, and generates a set of *.html files that the MCP server serves from memory.

// scripts/build-static-resources.ts
import { build } from "esbuild";
import { renderToString } from "preact-render-to-string";
import { SearchWidget } from "../widgets/SearchWidget";

async function buildSearchWidget() {
  const html = renderToString(<SearchWidget />);
  await writeFile("public/widgets/search.html", html);
}

async function main() {
  await build({
    entryPoints: ["widgets/index.tsx"],
    bundle: true,
    outfile: "public/widgets/bundle.js",
    minify: true,
  });

  await buildSearchWidget();
}

main().catch((err) => {
  console.error(err);
  process.exit(1);
});

At runtime, the server just reads those files into memory on startup and hands them back when a resource is requested. The browser does the rest.

// lib/mcp/static-resources.ts
const cache = new Map<string, string>();

export async function getResource(name: string) {
  if (cache.has(name)) return cache.get(name)!;
  const html = await fs.promises.readFile(
    path.join(process.cwd(), "public/widgets", `${name}.html`),
    "utf8",
  );
  cache.set(name, html);
  return html;
}

That choice is pragmatic: SSR makes sense when you control the whole document and care about SEO; in an embedded chat widget, once the iframe is loaded, the limiting factor is tool latency and model streaming, not TTFB from your own server.

ykVHCvfbHFqc5Vc4Bokpdk.png

Why CSR Wins Here

Both ChatGPT and Claude currently run Apps and MCP tools inside tightly sandboxed iframes, with a bridge that talks JSON‑RPC over postMessage. Tool calls already sit behind multiple seconds of model latency, safety checks, and sandboxing; adding another server render on every interaction just to send a slightly different HTML shell is wasted time.

The pattern that performed best looked like this:

  1. Load a small Preact app once into the iframe.
  2. Let the MCP tool return structured content plus a resource reference.
  3. Hydrate the widget in place using the tool output that ChatGPT injects into window. developers.openai
// in the widget JS bundle
declare global {
  interface Window {
    openai?: {
      toolOutput?: unknown;
    };
    __WIDGET_CONFIG__?: unknown;
  }
}

const root = document.getElementById("root");
const initialData = window.openai?.toolOutput;
const config = window.__WIDGET_CONFIG__;

hydrate(<SearchWidget data={initialData} config={config} />, root!);

Once the shell is hot, each interaction is just MCP server time plus rendering in the iframe, usually on the order of tens or hundreds of milliseconds for the UI part. The difference between that and a server‑rendered setup isn’t huge in absolute terms, but it’s noticeable when the model is already slower than earlier, non‑reasoning variants.

Caching: Making Serverless Feel Stateful

All of this runs on Vercel’s serverless platform, which means every request might hit a different container. To keep that from feeling like a cold start every time, I leaned on Next.js’ data cache, Vercel’s edge CDN, and a small key‑value store for shared data.

The main caches:

  • A Next.js data cache for the sitemap and class metadata, with 24‑hour TTL.
  • A separate layer for class details with a shorter TTL, cached at both the edge and in serverless containers.
  • In‑memory caches inside each function for short‑lived things like shard manifests and search indexes.

The code pattern is simple and repeatable:

// lib/prospects/cache.ts
import { unstable_cache } from "next/cache";

async function fetchSitemap() {
  const res = await fetch("https://.../sitemap.json", {
    cache: "force-cache",
  });
  return res.json();
}

export const sitemapCache = unstable_cache(fetchSitemap, ["sitemap"], {
  revalidate: 60 * 60 * 24,
});

Every tool that needs class indexes calls sitemapCache.getSitemapData() and builds its own lookups on top. That keeps the index consistent across search, fetch, training plan creation, and scheduling, even if classes are added, removed, or reordered. It’s mundane plumbing, but it’s the difference between “the model is wrong” and “the data is stale.”

Search Performance Tricks

Search is where most of the CPU goes. The index covers tens of thousands of classes, and prompts like “5k run, intervals run, endurance run” turn into multiple queries that hit the same shards. I treated that like any other search backend: parallelise what you can, deduplicate shared work, and bail out early when you’ve got enough good results.

The implementation leans on three ideas:

  • Start the sitemap fetch in parallel with the search.
  • Load each shard once per invocation and reuse it across queries.
  • Stop scoring once you have more high‑quality results than you need.
// lib/mcp/tools/search/handler.ts
export async function handleSearchMulti(queries: string[]) {
  const sitemapPromise = sitemapCache.getSitemapData();

  const results = await Promise.all(
    queries.map((q) => enrichedSearch(q, sitemapPromise)),
  );

  return results.flat();
}

Inside enrichedSearch, there’s an early exit once the worst score in the current result set crosses a threshold:

const matches: ScoredClass[] = [];

for (let i = 0; i < candidates.length; i++) {
  const score = scoreCandidate(query, candidates[i]);
  if (score >= MIN_SCORE) {
    matches.push({ score, cls: candidates[i] });
  }

  if (matches.length >= limit * 3 && i % 100 === 0) {
    matches.sort((a, b) => b.score - a.score);
    const worst = matches[limit - 1];
    if (worst && worst.score >= 0.8) break;
  }
}

It’s not complicated, but it takes high‑match queries from hundreds of milliseconds down to something that feels instant from inside chat. The interesting part is that all of this is independent of the model; it’s just a search engine living next to an AI runtime.

Learnings

The main thing I learned is that “AI apps” are mostly regular distributed systems with a strange client. The model is a chatty, expensive, semi‑deterministic user that happens to sit between you and the human. The interesting engineering work is in making that user fast, predictable, and safe to talk to your APIs.

Some specific takeaways:

UI still matters in a chat box. If your widget feels slow, users will blame the model, not your hydration strategy. Reasoning time dominates everything; when the model is thinking for seconds, shaving 150 ms off your tool isn’t about raw speed, it’s about perceived responsiveness. Closed‑world tools are easier to reason about: keeping everything inside first‑party APIs with explicit security flags makes it much clearer what can go wrong. Caching is still the main lever; the systems that feel “instant” are just the ones that avoid doing work twice.

None of this is new; it’s the same set of trade‑offs we make for any high‑traffic web app. The only difference is that the traffic here is mediated by a model that decides when to call you.

What I’m Interested In Now

Working on this pushed my interest further away from models themselves and toward the ecosystems around them: MCP hosts, app stores built into chat clients, and the economics of routing millions of queries through tools without falling over. I’m less interested in squeezing another point on a benchmark and more interested in things like: how do we measure latency when a turn involves multiple tool calls, UI updates, and model retries.

It also made “generative UI” feel less like a buzzword and more like a natural step: the model decides when to render a widget, what data to pass into it, and how to update it as the conversation evolves. Once you see that pattern, it’s hard to unsee it the chat window starts to look like an operating system, not just a textbox.

Where I’m Headed

From here, I want to push on two fronts: better harnesses for testing these systems end‑to‑end, and better observability for AI‑driven flows. I care less about “prompt engineering” and more about being able to say, with confidence, where 400 ms went in a given turn and how often the model took a slow path.

Building one of the first ChatGPT Apps was a good reminder that the hard problems aren’t mystical. They look like latency, caching, UI, and system design the same problems we’ve had for years, just wrapped around a different kind of runtime. I don’t know exactly what the next project will be, but I know it will sit somewhere in that intersection of models and systems, close to the metal and pointed at real users.