Introduction: A Unique Challenge for AI Development
Artificial intelligence is rapidly transforming how we work, making effective integration crucial for organizations. Incorporating new AI tools and features poses a new set of challenges for AI product development. One specific challenge arises from the variability of user interactions with unreleased AI models. Traditionally, research has been tied to design, but with chatbots, copilots, and other AI-driven experiences, AI systems must exist first to observe meaningful user interactions. This shift requires teams to develop new processes and methods to empower product development in the age of AI.
This paper presents one such novel methodology that integrates real user interactions into GPT-driven simulations to foster an adaptive and user-driven approach to AI design and development. It also outlines considerations for utilizing this method within the product development process and the potential for AI agents to act as simulated users to create more robust AI products.
Background: Why UX Process Must Adapt
Unlike traditional UX research, which focuses on measuring user behaviors, UX research for AI requires an additional layer—ensuring the AI itself can accurately interpret and execute user queries. User experience hinges on both understanding what an AI can do and knowing how to prompt it effectively. Historically, research has either led or followed design, with development occurring afterward.
However, AI research presents a unique challenge: the AI system must exist before meaningful user interactions can be studied, making it difficult to anticipate user behavior. This requires a fundamental shift—AI model development can benefit by preceding research and design. Recent research from Google DeepMind (Morris, 2025) highlights similar challenges in human-computer interaction (HCI) for AI, emphasizing the difficulties of bridging the “gulf of execution” and “gulf of evaluation”—ensuring users can properly communicate their intent to AI and validate the AI’s responses. As AI and copilots become integral to business automation, enabling users to perform complex tasks with ease, understanding user interactions with these products before release is essential.
A Copilot Learning Experience
In 2024, Workato introduced Recipe Copilot, leveraging AI to assist users in writing and editing recipes. While adoption was high, a usability study of user interactions revealed that certain input styles yielded accurate outputs, whereas others led to errors, frustrating users.
Despite strong promise for enhanced productivity and above-average performance, the Copilot experience required greater stability and accuracy to convert user inputs into meaningful recipes. I sought to help the team explore how research could assist QA in improving Copilot reliability.
Through a qualitative study of user interactions with our Copilot, three distinct user input styles had emerged:
- Directive Prompting – Users provided explicit, step-by-step instructions in a single input.
- Goal-Oriented Prompting – Users described the desired outcome without specifying implementation details.
- Exploratory Prompting – Users refined their requests iteratively based on AI responses.
Using these insights, I reverse-engineered a structured framework that QA and engineering could utilize to enhance AI-generated outputs based on real-world use cases. This methodology enables AI systems to adapt to user behavior, improving automation workflows and overall satisfaction.
Simulating Prompt Styles for Research and QA
Training an AI model requires robust QA testing that mirrors real-world user interactions—many of which may be incomplete, ambiguous, or exploratory. For instance, we often see users turn to AI for guidance when they lack full understanding of a task themselves.
To simulate this variability, I developed a methodology using ChatGPT to generate simulated conversations for development testing. The goal was to produce idealized recipe outputs for the top recipe use cases and to provide varied, user-driven conversations that lead to that idealized result.
Methodology Steps:
- Defining Top Recipe Types – Identifying key automation use cases in Snowflake, categorized by unique recipes, triggers/actions, and apps used.
- Training GPT on Prompting Styles – Generating simulated conversations for each identified use case from the Copilot research (above).
- Generating Expected Outputs – Creating three output types to guide QA:
- Idealized Workato Recipe Outputs – Benchmarks for expected automation workflows.
- Simulated User Conversations – Demonstrating how different input styles influence Copilot interactions.
- Stage-wise Copilot Outputs – Analyzing AI behavior at each stage of user engagement.
These structured outputs provided QA with both user input patterns and expected AI responses, helping refine system performance and accuracy. For instance, with these outputs QA could test whether or not the AI outputs match the intended ideal path across top use-cases without having to hypothesize how a user might act.
The Expanding Role of Research in AI Development
Traditional UX research follows an iterative cycle between research and design before final product development. However, with AI, evaluating performance only after launch is too late. Instead, reorienting AI development toward an earlier Proof-of-Concept (PoC) phase enables iterative learning from real user interactions.
The Simulated Engagement Method:
- AI PoC Development – Engineers develop an interactive prototype based on an initial framework.
- Observing User Interactions – Users engage with the PoC, prompting the AI based on real-world needs.
- Identifying Interaction Styles – Categorizing user prompting/engagement strategies.
- Use-Case Analysis – Aggregating behavioral data to identify the most common automation use cases.
- Simulated Interactions – GPT-generated outputs model idealized AI responses across different prompting styles.
This methodology allows research to influence AI development earlier in the process, ensuring user-driven refinements rather than retrospective design modifications. By restructuring research and QA workflows, product teams can shorten feedback loops and accelerate development timelines while maintaining AI quality.
Future-Proofing AI with Agents
Observing real-world user interactions with a Proof-of-Concept (PoC) AI provides a foundation for understanding how users engage with the system. However, once interaction patterns are identified, AI agents can further contribute by assisting in quality assurance testing.
Rather than relying solely on predefined test cases, AI agents trained on observed user prompting styles could systematically probe AI systems, identifying weak points in responses and inconsistencies across different input patterns. This would allow for a continuous research-driven QA loop—where simulated agents replicate real-world interaction patterns to surface potential improvements before deployment.
By incorporating AI agents into QA workflows after research has established user input trends, companies can ensure AI models remain adaptable to evolving user behaviors. This additional layer of AI-assisted validation could help refine not only response accuracy but also interaction fluidity, ultimately improving AI copilots’ effectiveness in production environments.
Future work will explore how AI agents can autonomously identify edge cases and prompt refinements based on large-scale user behavior patterns, further extending the impact of research-driven AI development.
References
- Morris, M.R. (2025). HCI for AGI. Interactions Magazine.