For a brief moment, “prompt engineering” looked like a parlour trick—clever one‑liners pasted into chat boxes to squeeze better answers out of large language models. As LLMs have moved into fraud detection, marketing automation, legal workflows and children’s products, that view has quickly dated. Prompts now sit on the critical path of real systems. They govern how models behave, what they are allowed to say, and how reliably they perform under pressure.
Ellis Crosby is one of the practitioners pushing this discipline from craft to system. Coming from software and data engineering, he has never treated prompts as disposable text. For him, a prompt is a function embedded in a production system—something that needs clear ownership, versioning, testing and monitoring, just like any other interface that touches users. That mindset has been shaped by projects ranging from Scarlett Panda, a children’s reading app where prompts effectively formed part of the safety boundary, to enterprise‑grade chatbots and agents evaluated through golden conversations, edge‑case suites and long‑horizon simulations.
In this interview, Crosby, Founder and Prompt Specialist at Spring Prompt, argues that “industrial‑grade” prompt engineering starts long before anyone edits a system message. It starts with defining desired behaviour, encoding domain expertise as tests and rubrics, and building the organisational muscle to iterate safely. The goal is not to worship prompts as magic strings, but to fall in love with outcomes—and build systems that can deliver them as models, markets and risks change.
Which moments in your consulting or product work first convinced you that prompts deserved the same rigour as code, models, or data pipelines?
I came into AI from a data engineering and software background, so I think I saw prompts slightly differently from the start. To me, a prompt was never just a piece of text. It was a function: it took some input, transformed it, and produced some output. If that function was going into a real product, it needed ownership, versioning, testing, and some way of knowing whether it was still doing the thing we expected it to do.
I started working with language models around 2020, when they were much more expensive to run and the tooling around them was far less mature. At the time, most of my testing was fairly basic: spot checks, safety checks, and making sure we were not getting obviously dangerous or unsuitable outputs. It was not what I would now call a serious eval suite, but the instinct was already there. The prompt was part of the product, so it needed to be treated like part of the system.
That became especially clear with Scarlett Panda, the children’s reading app I built. In that context, the prompt was not just responsible for generating a nice story. It was part of the safety boundary of the product. Early on, we underestimated how creatively people might try to misuse a children’s story app. Basic moderation was not enough, because people would paraphrase, disguise intent, or route around simple keyword checks. That was one of the moments where it became obvious that prompting was not just “writing better instructions”. It was production behaviour. It needed adversarial testing, regression tests, edge cases, versioning, and clear ownership, just like any other safety‑critical part of the system.
ALSO READ: Agentic AI in Production: Why Better Prompts Won’t Bridge the Gap
Walk us through a concrete before‑and‑after: a workflow or prompt that “looked great” in a playground, and what changed once you put it through a serious eval suite.
The most honest before‑and‑after for me is not one project where we had no evals and then magically added them later. It is the contrast between earlier projects where prompt quality was reviewed manually, and later projects where evals were built in from the start because we had learnt the cost of not doing that.
On one sales‑personalisation project, the prompts were extremely long and context‑heavy. They pulled in seller context, buyer context, research questions, source material, positioning, tone, and more. In some cases, the prompts were thousands of lines long. The outputs could look excellent: deeply personalised and much better than a generic template. But the process around them was fragile. Every prompt change had to be reviewed manually by one senior stakeholder, which meant a small change could take one or two days to approve. It also made evaluation subjective, and because review was slow, we kept testing against the same small set of examples. That creates a real risk of overfitting: you optimise for five examples, or one person’s taste, rather than the real distribution of users and situations.
On more recent chatbot and agent projects, including the thinking behind Spring Prompt, we took the opposite approach. Before touching the prompt, we defined the behaviour we wanted, created golden conversations, edge cases, and scoring rubrics, and built evals around them. Some of those conversations were designed to span many turns, because a chatbot can look fine in one response but drift after eight, fifteen, or thirty messages. The biggest change was speed. Instead of waiting days for a subjective opinion, we could make a change, run an eval, compare variations or models, and see the impact quickly. It turns prompt engineering from taste‑based editing into an engineering loop.
The quality improves, but the bigger shift is the pace of learning. If you find a new failure, you do not just patch the prompt for that one case. You add the failure to the eval suite, so the system becomes harder to break over time. That is a completely different way of working. It is much closer to software development: write the failing test, make the change, run the suite, and keep moving.
The key principle is: do not fall in love with the prompt. Fall in love with the outcome, and build a system that can keep producing it as models and requirements change.
If we treated prompts as first‑class assets like models or microservices, what would a healthy lifecycle look like, from first draft to “production‑ready”?
The first thing I would say is that the prompt itself is not really the thing you are building. You are building an outcome. The prompt is one implementation of that outcome. Prompts change, models change, and the best prompting style for one model may not be the best style for another. But the desired behaviour of the product should be much more stable.
So a healthy lifecycle starts before anyone writes the prompt. First, document what the system should actually do: the tone, actions, boundaries, information sources, refusal behaviour, reasoning style, and examples of good and bad outputs. From there, create test cases, golden conversations, scoring rubrics, and edge cases. For chatbots, I especially like golden conversations because some problems only appear over multiple turns: tone drift, overconfidence, excessive agreeableness, verbosity, or gradually moving outside the intended scope.
ALSO READ: The AI Infra Deals That Defined Q1 2026
Only once you have that measurement layer does it really make sense to optimise the prompt. That is a big part of why I started building Spring Prompt: I wanted a more practical way to connect prompt work to evals, model comparisons, and measurable behaviour, instead of relying on subjective review. Once you have a baseline, you can test variations, compare models, and understand the trade‑offs between quality, cost, and latency. Sometimes a larger model is better, but sometimes a smaller model is more controllable for tone or perfectly good for a narrower task. Without evals, a lot of those decisions are just vibes.
From there, the lifecycle should look much more like normal production engineering: versioning, changelogs, regression tests, deployment gates, monitoring, and rollback. When the desired behaviour changes, you update the documentation, add or adjust eval cases, establish a new baseline, and then update the prompt or model to meet it. The key principle is: do not fall in love with the prompt. Fall in love with the outcome, and build a system that can keep producing it as models and requirements change.
Prompting a chatbot casually in ChatGPT is not the same as designing production behaviour at scale.
In a high‑performing LLM team, how do you see responsibilities split between prompt engineers, ML engineers, data people, and PMs or domain experts?
Right now, this is still quite fluid in most companies. Some teams have AI engineers. Some have domain experts writing prompts directly. Some have engineers doing it. Sometimes it is a PM, sometimes it is a founder, and sometimes it is a C‑level executive who has been using ChatGPT a lot and wants the product to behave the same way. I do not think the exact job titles matter as much as the separation of responsibilities.
The most important split is between defining the desired behaviour and implementing that behaviour. Domain experts and PMs should be heavily involved in defining what good looks like: the ideal outputs, edge cases, safety boundaries, tone, and business rules. But that does not necessarily mean they should be writing the production prompt. Their expertise is usually better captured in behavioural specs, golden examples, and eval criteria.
Then someone with LLM engineering experience can translate that into the working system: prompts, evals, model comparisons, routing logic, structured outputs, failure modes, and production integration. Data people are also crucial, because many “prompt problems” are really context problems. The model is missing information, receiving stale information, getting too much context, or receiving context in a shape it cannot use well. ML engineers or infrastructure people can help with model selection, latency, cost, fallbacks, and routing.
The strongest people I have seen in this space are often cross‑domain translators. They understand the business or domain problem, but they can also turn that expertise into systems, tests, and automation. Prompting a chatbot casually in ChatGPT is not the same as designing production behaviour at scale. In ChatGPT, a one‑off success can feel very convincing. In production, you need that behaviour to hold across many users, edge cases, contexts, and model changes. That translation layer is where a lot of the value is.
The goal is not to create prompt celebrities. It is to build the internal ability to define outcomes, measure behaviour, and iterate safely.
If a VP asks, “Do I need a dedicated prompt team or just upskill my existing engineers?”, how do you answer?
For most companies, I would say a dedicated prompt team is probably overkill. There are exceptions, especially for AI‑native products with many prompts, agents, tools, experiments, and user‑facing surfaces. But most companies do not need a permanent prompt department. They need a repeatable prompt engineering capability.
That means having a clear process for defining behaviour, creating evals, testing prompts, comparing models, versioning changes, and monitoring performance. A good set‑up is often a cross‑functional group: a domain expert or PM owns the desired behaviour, an engineer owns the implementation and evaluation system, and someone with AI experience helps bridge the two. If the engineer is already cross‑domain, they may be able to cover more of the process. If the domain is specialised, the domain expert matters much more.
The biggest mistake I see is companies assuming that because someone can get a good answer from ChatGPT, they can design a production LLM system. Those are very different things. In ChatGPT, you are having a one‑off interaction and can steer it as you go. In production, the model is handling many users, hidden context, structured outputs, safety boundaries, cost constraints, latency constraints, and failure modes.
ALSO READ: Inside the Orchestration Crisis: Why AI‑Driven Enterprises Need a Control Plane, Not More Tools
So I would usually recommend upskilling existing engineers and PMs, but with support from someone who has done this before, whether that is an experienced AI engineer, specialist contractor, or agency. A specialist can help set up the foundation: evals, goldens, model tests, deployment patterns, and the first production workflows. After that, the internal team can keep improving it. The goal is not to create prompt celebrities. It is to build the internal ability to define outcomes, measure behaviour, and iterate safely.
Build a small eval suite. It does not need to be perfect. Even twenty good test cases are better than relying on vibes.
What is the lowest‑effort, highest‑impact change you would suggest a company make this quarter to move towards industrialised prompt engineering?
The highest‑impact thing most companies can do is stop working on the prompt for a moment and define what they actually want the AI to do. That sounds simple, but it is skipped all the time. A company will say, “We need a chatbot for sales,” or “We want an accounting assistant,” but when you dig in, the desired behaviour is not fully defined.
So the first step is a short cross‑functional behaviour‑definition process. What should it do? What should it not do? When should it escalate? What does a great answer look like? What is unacceptable? What sources should it trust? What tone should it have? Write examples, bad examples, and edge cases, then turn that into a lightweight rubric. You can use AI to help generate and expand those cases, but the team still needs to own the judgement of what “good” means.
From there, build a small eval suite. It does not need to be perfect. Even twenty good test cases are better than relying on vibes. This is exactly the sort of workflow I want Spring Prompt to make easier: define the behaviour, test the prompt against it, compare outputs, and make iteration feel more like engineering than guesswork. Once you have that, every prompt change becomes easier to judge, every new model can be benchmarked, and every user issue can become a regression test.
That is something a company can start in one or two weeks, and the impact compounds over the next ninety days. Every new edge case strengthens the suite. Every model release can be tested against your actual desired behaviour. Every prompt change becomes measurable. That is the real shift: moving from “we think this prompt is better” to “we can measure whether the system is behaving more like the product we want to build.”
ALSO READ: Start With the Context Layer First: A Framework for Production-Ready AI Agents
