Market research is a foundational activity in marketing because it generates the data and insight that inform everything from strategic positioning to day‑to‑day campaign decisions. In that sense, any major shift in how we gather or interpret data has an outsized impact on how we do marketing at all.
However, access to robust market research—whether through quantitative surveys or qualitative interviews—has long been constrained by cost, time, and specialist skills. With the rise of generative AI tools, this is about to change, particularly in terms of efficiency and accessibility for smaller organisations such as SMBs and startups.
In their Harvard Business Review article, authors authors Korst, Puntoni, and Toubia identify four main ways AI can transform market research. They are:
- Making existing activities faster, cheaper, or easier to scale
- Replacing some traditional research with synthetic data
- Filling gaps in market understanding
- Creating entirely new types of data and insight
The rest of this article focuses mainly on the second and third areas, where synthetic data and digital personas become particularly relevant for SMBs.
In my conversations with clients and students, the discussion usually focuses on efficiency gains: faster transcription, automated coding and analysis, or improved reporting. As Stefano Puntoni (Wharton) puts it, this is about “doing things better”. The more transformative opportunity, however, lies in “doing better things”: using AI not just to optimise existing workflows, but to open up entirely new ways of generating insight. Synthetic data and digital personas sit squarely in this second category.
In this article, I focus on synthetic data—especially digital personas—and how SMBs and startups can use AI‑generated market research data for faster, better‑informed decisions.
If you are more broadly interested in how generative AI can enhance current market research practices, I recommend the full HBR article, where the authors provide a simple grid to identify use cases.
1. Synthetic data in practice: digital personas and digital twins
In market research, synthetic data can be defined as AI‑generated data designed to replicate human preferences, responses, or behaviour that would normally be gathered from traditional survey or interview research as discussed by Korst, Puntoni, and Toubia.
Before you start creating synthetic data with the help of an LLM, there is a critical distinction that will shape how you generate it and what kind of output you can reasonably expect.
Digital personas (top‑down synthetic respondents)
The first approach is to create a digital persona (or synthetic persona). This is a fictional profile generated by an LLM, similar in spirit to the personas marketers have been creating for years. To build such a persona, you provide the model with demographic, psychographic, and behavioural information about a target segment. The resulting digital persona can then answer specific market research questions “on behalf of” that segment, which not only makes the process of creating personas easier, but also allows marketers to interact with them in real time and explore questions more quickly and in greater depth. The quality of the output depends heavily on the quality of the input data and the prompts, but the basic approach is accessible with any commercial LLM.
It is helpful to think of this persona as a probabilistic simulator of how a typical member of the segment might think and respond, not as a statistically representative sample of the segment itself.
Digital twins (individual‑level customer models)
The second approach is to create a digital twin. Here, the goal is not to represent a segment, but to approximate a specific, real customer as closely as possible. While you can start from publicly available data, the real power comes when a company uses its proprietary, longitudinal data on an individual customer’s behaviour to train the model. In that case, the aim is not to build a fictional aggregated profile, but to simulate how a known customer is likely to respond in different situations.
Because this requires sensitive, proprietary data and a high degree of control over the modelling process, it typically goes beyond what most companies will do with a standard commercial LLM interface. For organisations with the resources and technical capabilities, however, such digital twins open up new ways of testing markets by simulating how specific customers—or even whole ecosystems of customers—might respond to different scenarios, such as a price increase or a change in service conditions.
For SMBs, these techniques materially change the cost–insight trade‑off in market research. It drastically lowers the financial and time barriers to exploration, allowing teams to systematically test bold ideas and base decisions on directional data in situations where a traditional study would be too expensive—rather than relying purely on gut instinct.
This naturally raises a key question: what is the trade‑off between speed and quality? Is AI‑generated data good enough for real marketing decisions, especially given our collective experience with AI “hallucinations”? The next section looks at what recent research actually says about the accuracy and limitations of synthetic respondents.
2. Why you don’t need massive databases
In a subsequent Harvard Business Review article, Korst, Puntoni, and Toubia discuss a Columbia Business School study that compared the performance of digital twins and digital personas. In that study, digital twins did capture more variance because they were trained on richer, individual-level data, but when it came to predicting the exact answers real humans would give, complex digital twins and simpler synthetic personas both reached roughly 75% accuracy in their experiments.
This result is consistent with a growing body of work on “silicon samples” and synthetic respondents: studies show that, when carefully conditioned, large language models can approximate the distribution of human survey responses across different demographic subgroups with reasonable fidelity. Early evidence from both academia and industry suggests that synthetic samples can reproduce major attitudinal patterns and trends, even though they are not perfect replicas of real populations.
This suggests that the rich proprietary data required to build complex digital twins does not always translate into more accurate prediction of a single, exact answer than a well‑designed generic persona. For SMBs and startups, that is encouraging: it implies that “good enough” synthetic data can be available at relatively low cost, without the need for massive historical databases.
At first glance, a 75% accuracy rate for predicting exact human answers might sound like a limitation. In practice, however, it is more than sufficient for many operational marketing decisions, where the goal is directional guidance rather than perfect foresight. Professor Koen Pauwels from Northwestern University illustrates this with his experience working with Amazon Ads, where synthesised data was used to extract insights about consumer preferences and optimal product recommendations. In fast‑moving environments like Amazon, he notes that the guiding rule for decision‑making was that “70% is plenty”. Businesses do not need 100% perfect predictive information to act, they need reliable directional advice to secure buy‑in and run low‑risk, real‑world experiments.
Traditional human‑based research rarely delivers perfectly stable or fully accurate answers either, as shown by decades of work on response error and the gap between self‑reported and actual behaviour in survey research.
It is crucial, however, to test and validate synthetic outputs rather than treating them as ground truth. Marketers should never blindly trust AI‑generated data; they should compare synthetic results to real‑world benchmarks—whether through small‑scale surveys, experiments, or historical performance—and calibrate their expectations of accuracy accordingly. Business leaders should still rely on human judgment for final decisions, using synthetic data as a fast, directional input rather than a substitute for empirical evidence.
From a methodological standpoint, synthetic data should be understood as a complementary source of insight rather than a substitute for empirical observation. It excels at generating fast, low-cost directional hypotheses, particularly in contexts where traditional data collection is infeasible. However, because these systems rely on learned statistical patterns rather than lived experience, they risk producing internally coherent but externally invalid outputs. The practical implication is clear: synthetic insight is most valuable when used upstream in the decision process—to explore, prioritise, and frame hypotheses—while real-world data remains essential for validation and calibration.
3. Two ways to use synthetic respondents
In their HBR piece, Korst, Puntoni, and Toubia distinguish between two broad ways of using AI‑generated respondents: a top‑down and a bottom‑up approach. The distinction matters because it determines what kind of questions you can answer and how close the method feels to traditional market research.
The top‑down approach: one “super‑agent” per segment
The top‑down approach is the most readily available method for SMBs and startups. Instead of trying to build a full artificial panel, you ask an AI model to act as a “super‑agent”: a single, composite digital persona that represents a clearly defined segment.
- You describe the segment in detail (for example, “finance directors in mid‑size manufacturing firms in Germany” or “first‑year university students living in shared apartments”).
- You instruct the model to take on that role and answer questions as a typical member of that group.
- You then ask for a best estimate or a small range for the outcome you care about: average willingness to pay, likelihood to try a new offer, perceived risk, and so on.
This approach is essentially an expert estimate in persona form: you are not modelling the full distribution of possible answers, but asking the model to generate a plausible central tendency for the segment.
The advantage is speed and simplicity—you can do this with any commercial LLM and a well‑crafted prompt, without additional software or infrastructure.
Because the top‑down method collapses the segment into a single digital persona, it is well‑suited for early‑stage questions such as:
- “Is this positioning believable and appealing for our target buyers?”
- “What are the main barriers this segment would see to adoption?”
- “Roughly how price‑sensitive would they be in this category?”
The bottom‑up approach: building a “silicon sample”
The bottom‑up approach is closer to traditional market research. Rather than relying on one composite persona, you ask the model to generate a whole population of synthetic respondents that match your target segment criteria.
- You start by defining the segment as before, but then instruct the model to create a large number of individual personas (for example, 200 or 1,000), each with slightly different traits and opinions.
- This population of personas is sometimes referred to as a “silicon sample”.
- You then “field” your survey with this synthetic sample: you ask your questionnaire, collect their answers, and aggregate the results just as you would with a human panel.
The goal here is to approximate the distribution of answers within a segment, not just a single best guess. You can look at response variability, identify sub‑clusters with different needs, and run more advanced analyses such as simulated segmentations or conjoint exercises.
However, this second method comes with real overhead:
- You need to script persona generation, survey administration, and data aggregation.
- You need safeguards to ensure enough diversity and to avoid the model collapsing back to a single stereotypical respondent.
- You usually need some kind of tooling or code, not just a chat interface.
For most SMBs, this makes the bottom‑up approach more of a second step than a starting point. It is very powerful once you have some experience and basic infrastructure in place, but not necessary to get value from synthetic personas on day one.
Why this article focuses on the top‑down persona
Given the constraints most SMBs and startups face—limited budgets, limited time, and often no in‑house research team—the top‑down digital persona is the more practical entry point. It lets you start using synthetic respondents as a thinking partner for market decisions without building your own synthetic panel platform.
The rest of this article therefore focuses on that top‑down use case: how to define the segment, build a strong super‑agent persona, ask the right questions, and interpret the answers in a way that is useful, but methodologically honest.
4. How to build a useful digital persona
Once you accept that synthetic data is a complement to, not a replacement for, human research, the next question is practical: how do you actually build a digital persona you can trust as a thinking partner?
Start with the segment and the decision
Before opening any AI tool, be explicit about two things:
- The segment you want to simulate. Not just “SMB owners”, but “owners of 5–20 employee service businesses in Switzerland who handle marketing themselves”, or “IT managers in 50–200 employee industrial firms in Germany”.
- The decision you want to inform. For example: “Should we launch this feature?”, “How should we price the entry‑level offer?”, or “Which value proposition should we test first?”.
That pairing of clear segment and clear decision prevents you from generating a persona that is interesting to read but irrelevant for action.
What a good digital persona needs in practice
Once the segment and decision are clear, the goal is to brief the AI well enough that it stops behaving like an “average internet user” and starts acting like a plausible member of your segment. In practice, a useful digital persona needs three things:
- Context and constraints: What role does this person play, what are they responsible for, and what limits their choices (budget, time, regulations, internal politics)?
- Goals and success criteria: What does “a good outcome” look like for them in this decision? Saving time? Reducing risk? Hitting a target KPI?
- Typical attitudes and heuristics: How do they usually approach solutions like yours: cautious early adopter, price‑sensitive skeptic, brand‑driven buyer, etc.?
If you give the model this kind of information, you can simply ask it to “act as a typical member of the segment in this situation” and it will produce answers that are at least anchored in a realistic frame, instead of generic consumer talk.
How to use the persona without overtrusting it
With a digital persona in place, the temptation is to treat it like a magic focus group. A more disciplined approach is to use it as a structured thinking partner:
- Ask it focused questions tied to your decision (“How would you react to this price point?”, “What would make you hesitate?”, “Which of these three messages is most credible, and why?”), instead of asking about everything at once.
- Look for patterns and explanations, not exact percentages. You want to understand why this segment might react in a certain way, not to replace a full market sizing study.
- Do a quick sanity check: if the persona’s answers completely contradict what you already know from real customers, treat that as a signal to dig deeper, not as a revelation.
5. Putting your digital persona to work
Once you have a reasonably well‑defined digital persona, the risk is to let it sit in a slide deck. The real value comes from using it in your everyday decisions—and treating it as something you can refine over time.
A practical starting point is to plug the persona into decisions where you currently rely on educated guesses: early‑stage product ideas, pricing discussions, or messaging debates.
Before a meeting, you can run a quick synthetic “pre‑mortem”: ask the persona how it would react to a proposed offer, what would make it hesitate, or which of two value propositions feels more credible and why. You are not looking for definitive answers, but for sharper questions and better‑informed hypotheses going into the room.
You can also create digital personas for segments you rarely have budget or access to research properly—for example, specialised B2B decision‑makers or regulated patient groups—and use them to get directional insight on how those audiences might perceive your brand, rank competing benefits, or respond if a new entrant changed the rules of the category. In those cases, synthetic insight is not replacing a survey; it is replacing no data at all.
Finally, treat each persona as a living object. As you gather real‑world feedback—campaign performance, sales conversations, customer interviews—adjust the persona’s constraints, goals, and attitudes. If it consistently overestimates enthusiasm or underestimates price sensitivity, bake that correction into the way you brief and interpret it next time. Over time, you are not just “using AI”; you are building a lightweight, evolving decision aid that reflects both model knowledge and your own experience.
Conclusion: from survey tool to strategic capability
The key message is simple: synthetic data, and especially well‑built digital personas, are not just a cheaper way to run surveys; they are a strategic capability that lets SMBs ask questions they could never afford to ask before, reach audiences they could never practically reach, and test ideas that would otherwise be dismissed as too risky or expensive to investigate. Used as a complement to real‑world data and judgment (not a substitute), they shift market research from a rare, expensive event to an everyday decision aid.
The enduring value lies in the “and, not or” mindset: AI augments human insight, it does not replace it. Map your research process, pick the one step where the cost or friction is currently highest, and run a small synthetic study against a ground‑truth sample. Learn from the gap between persona and reality, adjust, then expand. That is how you move from doing things better to doing better things.
I’m curious: what’s one research question you’ve parked because it felt too costly or complex to investigate? Drop it in the comments so others can learn from it too.
If you like this kind of research-backed marketing insights, subscribe to the newsletter at marclounis.com to get a condensed version every month.

Marc Lounis
Sources
Argyle, L., Busby, E., Fulda, N., Gubler, J., Rytting, C., and Wingate, D. (2023) “Out of One, Many: Using Language Models to Simulate Human Samples.” Political Analysis, 31(3), 337–351.
Empathy Lab (2025) “Synthetic Personas Speed Up Research, but Human Judgment Remains the Compass.” Available at: https://www.empathylab.com/en/news-and-insights/synthetic-personas-speed-up-research-but-human-judgment-remains-the-compass
Korst, M., Puntoni, S., and Toubia, O. (2025) “How Gen AI Is Transforming Market Research.” Harvard Business Review.
Korst, M., Puntoni, S., and Toubia, O. (2025) “How to Use AI-Generated Data in Market Research.” Harvard Business Review.
Mueller, K. et al. (2025) “KI in der Marktforschung: Frameworks und Beispiele aus der Praxis.” ZHAW / Swiss Insights.
Pauwels, K. (2025) “Amazon Ads / synthetic data discussion” (video/interview reference).
Toubia, O. et al. (2025) “Using LLMs for Market Research.” Columbia Business School / SSRN Working Paper.
“Response Errors in Survey Research” (1980) California Management Review.
