Home
Blog
GitHub
X
Back to blog

How I Used Codex To Build Better AI-Generated Websites

June 30, 2026
AI EngineeringFrontendBusiness Automation

Most AI-generated websites have the same problem: they look like a model was asked to make something impressive before anyone decided what "good" meant.

You can usually see it immediately. There is a big gradient, a vague hero line, some cards, a few hover effects, a dashboard-looking section that does not belong to the business, and a testimonial block with no real point. The page may compile. It may even look decent in a screenshot. But it does not feel directed.

That is the gap I wanted to close.

I was not trying to prove that Codex could generate a website. That part is already obvious. I wanted to know whether I could make Codex produce better UI by changing the system around the model: the research, the rules, the boilerplate, the documentation, the working context, and the way implementation tasks were handed off.

The short version is this:

The prompt was not the breakthrough. The context was.

I used Codex as the model harness, but the work became interesting when I stopped treating the prompt like the whole product. The better results came from turning design research into an agent-ready operating system: a reusable foundation that gave Codex enough direction to build from without spending the whole session rediscovering the basics.

That is how I got from a researched agency site like Imprint to one-shot outputs like Ultimate Portfolio and Award Signal.

None of it is perfect. That is important to say clearly. The WebGL work especially has rough edges. But the difference between generic AI UI and agent output that actually has a design direction was obvious enough that I kept going.

This post walks through that process in practical terms: what changed, why it worked, where it still falls short, and how the same pattern applies to business automation.

The Three Examples

There are three examples behind this article.

The first is Imprint, an agency website concept. This came from a more deliberate process: basic research into award-winning websites in the same niche, a curated dataset of references, then a simple ruleset and documentation describing what the site should feel like, how it should be structured, and what kind of design language the agent should preserve.

The second is Ultimate Portfolio, a portfolio landing page concept. This came later, after I had a larger design dataset and a cleaner way to turn research into instructions.

The third is Award Signal, a procedural WebGL portfolio concept. This was also produced from the later version of the workflow.

The part that matters: Ultimate Portfolio and Award Signal were generated from single short prompts. Not a long prompt dump. Not a multi-day handholding session. One short prompt, under four sentences.

That does not mean the model magically became a designer. It means the prompt was sitting on top of a better system.

Terminal
example             role                         workflow
Imprint             agency website concept       research + ruleset + iterative Codex development
Ultimate Portfolio  portfolio landing concept    one short Codex prompt on top of the refined system
Award Signal        WebGL portfolio concept      one short Codex prompt on top of the refined system

That distinction is the entire point of the article.

AI website generation gets better when the model is not asked to carry the whole process in one prompt. The model needs a working environment, a design vocabulary, a repeatable set of standards, and enough structure that it can execute instead of improvise.

The First Version: Imprint

Imprint was the first useful version of the idea.

The goal was not to ask Codex to "make a premium agency website." That kind of instruction is too vague. It gives the agent permission to reach for every common visual trope in the AI-generated website drawer.

Instead, I started with research.

I looked at award-winning websites in the agency and digital-product space. I paid attention to layout, pacing, type, media, motion, information architecture, proof, and how the sites created brand confidence before the user had to read too much. I was not trying to copy any one site. I was trying to understand the design language that made a site feel intentional.

Then I turned that into a small ruleset.

The ruleset answered questions like:

  • What should the first viewport prove?
  • How much visual density is acceptable?
  • What kind of motion should be allowed?
  • What kind of motion should be avoided?
  • How should service clarity show up?
  • How should proof be introduced?
  • What should mobile preserve from desktop?
  • Which sections are reusable?
  • Which sections are too specific to hard-code?

That is the step most people skip.

They go from inspiration directly to prompting. The result is usually imitation. The better move is to translate inspiration into constraints.

For Imprint, the planning work happened in a higher-reasoning Codex run. The development work then moved through more focused medium- and lower-reasoning passes. That split matters. In my experience, higher-reasoning runs are useful when the task is ambiguous and the system needs to be shaped. Lower- and medium-reasoning runs can be better once the decisions are already made and the agent needs to implement concrete changes.

That is not a universal law. It is an observation from my workflow. But it has held up enough that I now think about agent work in phases:

Terminal
research and system design -> higher reasoning
component implementation   -> focused medium reasoning
small edits and cleanup    -> lower reasoning

The point is not that one setting is better. The point is that the agent should not be doing every kind of thinking at the same time.

The Better Version: One-Shot Outputs

After Imprint, I came back to the idea with a larger dataset and a clearer workflow.

That is where Ultimate Portfolio and Award Signal came from.

I had a better understanding of what the agent needed before it started writing code. I had a clearer design language. I had a better sense of the difference between an instruction that sounds nice and an instruction that actually constrains output. And I had a stronger belief that boilerplates are not just convenience. They are context infrastructure.

The result was that Codex could generate stronger first passes from much shorter prompts.

Ultimate Portfolio and Award Signal were both generated from a single short prompt. Under four sentences. That is the part that looks impressive from the outside, but it is also the part people can misunderstand.

The short prompt worked because the hard work had already been moved somewhere else.

It was not "one prompt made a good website." It was "one prompt activated a prepared system."

That is the difference between prompt engineering and context engineering.

Prompt engineering is about what you type. Context engineering is about everything the model can use when it decides what to do. Anthropic describes context as the tokens available to the model at sampling time, and frames context engineering as the work of optimizing those tokens so the model is more likely to produce the desired behavior. That is a useful term because it moves the conversation away from magic wording and toward system design.

For websites, context engineering means the agent should not start from an empty folder and an aesthetic adjective. It should start with:

  • a stack that already works
  • components that match the intended design language
  • examples of good and bad patterns
  • rules for layout and motion
  • documentation for what the site is trying to communicate
  • a content structure that search engines and AI systems can parse
  • a clear definition of what the agent should not do

That is why the same model can produce generic output in one environment and much better output in another.

The harness matters.

Why Boilerplates Matter For AI Agents

A normal developer boilerplate saves setup time.

An AI-agent boilerplate does more than that.

It reduces the amount of context the agent has to reconstruct. It gives the model a working foundation immediately. It makes repeated decisions less expensive. It preserves conventions across sessions. It also lowers the chance that the agent spends half of its useful attention on setup, package choices, file organization, design primitives, and baseline layout decisions.

That is why I started thinking about the boilerplates as agent infrastructure.

The goal is not just "start faster." The goal is "start with fewer ways to drift."

When an agent starts from a strong foundation, the task can be narrower:

Terminal
bad task shape:
build me a premium WebGL portfolio from scratch

better task shape:
use this portfolio system, preserve the design language, add this content,
and adapt the interaction model for this specific person

Those are completely different tasks.

The first task asks the agent to invent product strategy, design language, implementation architecture, motion rules, responsive behavior, and content structure at the same time.

The second task asks it to adapt a system.

That difference matters because LLMs do not use long context perfectly. The paper "Lost in the Middle" showed that models can perform worse when relevant information appears in the middle of long contexts, even when the model technically supports long inputs. In practical agent work, I see a related issue: long sessions accumulate research, tool output, partial edits, corrections, summaries, and stale assumptions. Eventually the agent may still have a lot of context, but the useful design intent is no longer easy to preserve.

I used to describe that informally as "token fatigue." The better term is long-context degradation. More specifically, for agent sessions, I think of it as context drift and lossy context compaction.

The model may not be out of tokens. The problem is that the important tokens are competing with too much accumulated noise.

Boilerplates help because they move repeated context into code, structure, and documentation. Instead of re-explaining the same design system in every session, the agent can inspect it.

That is a much better place for context to live.

Agent Skills Turn Workflow Into A Reusable System

The next layer is agent skills.

Codex supports agent skills: task-specific packages of instructions, resources, and optional scripts that Codex can load when a workflow applies. The open Agent Skills format describes a skill as a folder with a SKILL.md file, plus optional scripts, references, templates, and other resources.

That matters because a repeatable AI workflow should not depend on me remembering the perfect instruction every time.

For this design workflow, the skill concept is a natural fit:

  • one skill for award-site design research
  • one skill for extracting reusable design patterns
  • one skill for translating research into frontend constraints
  • one skill for building a website boilerplate from those constraints
  • one skill for verification across desktop, mobile, performance, and content structure

This is where the work becomes more than a prompt.

A prompt is easy to lose. A skill can be versioned. A prompt is hard to audit. A skill can include references and scripts. A prompt tends to get longer over time. A skill can use progressive disclosure, loading the detailed instructions only when the workflow requires them.

That last part is important. Codex documentation describes skills as using progressive disclosure to manage context efficiently: the agent starts with skill names and descriptions, then loads the full instructions only when it chooses a skill. That is exactly the kind of pattern I want for complex website work.

The system should not dump everything into the model all the time. It should make the right context available at the right moment.

The WebGL Problem

Award Signal is the most technically interesting example and also the easiest one to critique.

The procedural WebGL work is impressive for a one-shot output. Codex programmed the models and scene behavior itself. But the WebGL models are also the weakest part of the site.

That is not surprising.

Procedural 3D is a much harder design target than layout, typography, or motion timing. A generated 3D object can be technically valid and still look wrong. It can render, animate, and respond to the cursor while still lacking the craft that makes 3D feel intentional.

My takeaway is not "agents are bad at WebGL." My takeaway is that WebGL needs its own research layer.

The next version should not ask the agent to invent every model from scratch. It should have:

  • a curated dataset of procedural 3D references
  • rules for when WebGL should be used
  • standards for model quality
  • a library of prefabricated open-source models where appropriate
  • an SOP for adapting 3D assets without hurting performance
  • mobile fallbacks and reduced-motion behavior
  • verification that the 3D element serves the page instead of distracting from it

That would turn WebGL from a novelty layer into a design system layer.

The broader lesson is that every difficult domain needs its own context. WebGL has different failure modes than typography. Business automation has different failure modes than a landing page. A CRM workflow has different failure modes than a portfolio site.

AI agents get better when we stop pretending all tasks need the same kind of prompt.

Why This Matters For Business Automation

This started with websites, but the real business value is bigger.

The same pattern applies to business automation.

Most companies do not need a random AI demo. They need repeatable systems around messy workflows: lead intake, quoting, onboarding, reporting, CRM cleanup, content operations, customer support, internal dashboards, document processing, and follow-up sequences.

The mistake is the same one people make with websites. They ask the model to perform the whole workflow from a vague instruction.

That can work once. It rarely becomes reliable business infrastructure.

The better approach is to build the system around the agent:

  • define the workflow
  • document the edge cases
  • create reusable templates
  • connect the right tools
  • define what good output looks like
  • test the workflow repeatedly
  • package the instructions so the agent can reuse them
  • keep humans in the loop where judgment matters

That is why a website boilerplate and a business automation system are more related than they look.

Both are attempts to reduce ambiguity before the model acts.

In a website, ambiguity creates generic UI. In business automation, ambiguity creates broken handoffs, bad data, inconsistent output, and silent operational errors.

The solution is context engineering.

SEO And AI Discovery Are Part Of The Product

There is also a search reason to care about this.

Google's guidance around generative AI features in Search still points back to fundamentals: valuable, unique, non-commodity content remains important. That matters because an AI-generated website that looks good but says nothing specific is not doing the full job.

If a site is meant to build trust, it needs more than visual polish. It needs clear entities, specific claims, crawlable structure, useful headings, internal links, metadata, and content that a human or AI system can summarize accurately.

That is another reason I care about the boilerplate layer.

The best version of an AI website workflow should not only produce prettier pages. It should produce pages that explain the business clearly, support search, support AI answer engines, and create a path for conversion.

For my own site, the goal is not traffic for traffic's sake. The goal is to make the work discoverable by the right people:

  • founders trying to use AI without building a toy
  • businesses that need automation but do not know where to start
  • teams that want better websites, internal tools, or content systems
  • people evaluating whether I can turn AI workflows into production systems

That is why the article itself has to be specific. Generic AI content is commodity content. A real case study is not.

What I Would Do Differently Next Time

If I were starting the workflow again, I would formalize it earlier.

The first version worked because I had enough taste and enough context to push Codex in the right direction. But the repeatable value comes from turning that taste into artifacts the agent can use without me reconstructing the whole theory in every session.

The next version would have more structure:

  1. A research corpus organized by niche, visual pattern, interaction pattern, and business goal.
  2. A design-pattern library that separates what to learn from what not to copy.
  3. Boilerplates for specific site categories: agency, portfolio, product, local service, SaaS, and WebGL showcase.
  4. Agent skills for research, synthesis, implementation, QA, SEO, and publishing.
  5. A screenshot and visual-regression workflow for desktop and mobile.
  6. A WebGL asset SOP with rules for prefabricated models, procedural models, and fallbacks.
  7. A content framework that makes every generated site easier for search engines and AI systems to understand.

That is the version that becomes a product.

Not a single website. A system for producing better websites and automations with less waste.

What This Proves And What It Does Not

This does not prove that AI can replace design.

It does not prove that a one-shot prompt is enough.

It does not prove that every business should let an agent generate its entire website without review.

What it does prove, at least to me, is that Codex can produce meaningfully better frontend work when the agent is given the right operating system around the task.

The formula is not:

Terminal
better prompt -> better website

The formula is closer to:

Terminal
research -> context engineering -> agent skills -> boilerplate -> focused Codex pass -> verification

That is a more useful mental model.

It explains why one person can get generic output from the same model while another gets a stronger result. It also explains why the same agent can perform better after the setup work has been moved into reusable context.

The model matters. The harness matters more than most people think.

The Practical Takeaway

If you are trying to use Codex or another AI agent to build better websites, do not start by asking for a beautiful website.

Start by building the conditions for a better website to be generated.

Curate references. Extract patterns. Write rules. Build a foundation. Reduce repeated decisions. Use skills or SOPs. Keep high-level planning separate from implementation. Verify the output on real screens. Treat SEO and AI discovery as part of the product, not something you sprinkle on after the page looks good.

That is how AI-generated UI gets better.

And if you are a business owner, the same lesson applies beyond websites.

The value of AI is not the demo. It is the repeatable workflow. If you want AI to help with your website, content, operations, internal tools, or customer workflows, the hard part is not typing the prompt. The hard part is designing the system around the agent so the output is consistent, useful, and connected to the business.

That is the work I am interested in.

If your business needs a better website, an AI-ready content system, or automation that turns messy daily workflows into something repeatable, contact me. I can help build the system around the agent, not just the prompt in front of it.

Sources And Further Reading

  • Imprint agency website concept
  • Ultimate Portfolio landing page concept
  • Award Signal procedural WebGL portfolio concept
  • OpenAI Codex: Agent Skills
  • Agent Skills overview
  • Anthropic: Effective context engineering for AI agents
  • Lost in the Middle: How Language Models Use Long Contexts
  • Google Search Central: Optimizing for generative AI in Search

Enjoyed this post?

Get new posts delivered to your inbox — no spam, just signal.

Subscribe on Substack