IRQ Research Findings

DARE / Resolution

The first response
is the audition.

What 9 customers and 10 pros told us about AI-assisted initial responses on Thumbtack.

Customer concept testing (N=9) + Pro concept testing (N=10). Directional findings, core patterns replicated across studies.

~15 min read

What the two studies, together, tell us

  1. 01

    The first response carries more leverage than we previously gave it credit for.

    The LLM-assisted Response Quality Investigation (RQI) positioned follow-up as higher leverage than first-response on the assumption that many first responses never reach a decision threshold. Customer concept testing sharpens that: once a customer has 2–3 viable pros in hand, the lower-ranked responses have no recovery path. The first response is effectively the audition. This doesn't invalidate follow-up as a learning surface, but it raises the bar on what v1 of this work has to get right.

  2. 02

    "Acknowledgment + one bounded next step" replicates as the success pattern — with two critical refinements.

    (a) Acknowledgment has to be lightweight — 1–3 high-signal details, varied across pros, not a transcript of the request form. Over-mirroring feels mechanical and breaks the trust signal it's meant to create. (b) The bounded next step should be a question, not a declarative statement, so the cognitive cost of replying is low and the ball is in the customer's court.

  3. 03

    "No next step" is an unrecoverable failure. Ignoring a stated detail is an unrecoverable failure.

    These are the two hardest failure modes across all 9 customer sessions. Everything else is gradient.

  4. 04

    Pros preferred the toggled draft (Concept 2), but the mechanism is trust, not performance.

    The toggles didn't change what pros sent so much as how confident they felt sending it. Since C2 can't ship in v1 on the current timeline, the plain draft must clear a higher bar on accuracy and voice to compensate — pros who don't trust the output will delete it or disable the feature.

  5. 05

    Anti-redundancy across pros is a product requirement, not a nice-to-have.

    Both sides surfaced this independently: customers detect AI when multiple pros send near-identical messages; pros refuse to adopt a tool that flattens their differentiation. Currently scoped P1 — the research suggests it's closer to P0 for adoption.

  6. 06

    Credibility is behavioral, not boilerplate.

    Generic credibility signals ("highly experienced technicians," "top reviews") are largely invisible by the time the customer is reading the message — they've already seen reviews on the SP. What builds trust in-message is demonstrating competence through the right question.

Two studies, designed to triangulate

Nine customer sessions had participants submit a real project request and interact with seven pros on a prototype, each pro trained on a distinct response archetype (5 failure patterns, 2 success patterns) distilled from the 600-thread RQI investigation. Participants ranked and reasoned about pro responses at the end. Ten pro sessions tested three AI-assisted response concepts: C1 plain AI draft (closest to the v1 PRD scope), C2 structured draft with content/tone/length toggles, and C3 one-tap send from the leads list as a deliberate boundary test.

The two studies were designed to triangulate. Customer sessions stress-tested the behavioral content the tool needs to produce. Pro sessions stress-tested the workflow and control around how it gets produced and sent. The customer study was designed to check whether the response patterns from the RQI hold up when real customers interact with them. The pro study was a divergent-prototype study intended to surface behavioral limits — not to converge on a preferred UI.

Customer study

Customer N=9 P1 P2 P3 P4 P5 P6 P7 7 pros 5 failure + 2 success patterns

Each customer submitted one real project and interacted with seven pros trained on distinct response archetypes.

Pro study

Pro N=10 C1 Plain draft C2 Toggles C3 One-tap send

Each pro reviewed three divergent AI-assist concepts. The set was designed to surface behavioral limits, not to pick a winner.

Customer-side prototype showing 7 pro archetypes in a sidebar and an active chat with Pro 3
Customer-side prototype: participants chatted with seven pros trained on distinct response archetypes. Prototype at arch-one.vercel.app.

The three pro-side concepts were intentionally divergent, not incremental variations — each was included to answer a specific question.

C1 plain AI draft concept
C1 · Plain draft Closest to the v1 PRD scope. Anchor concept: do pros trust and use an AI-generated draft they can edit?
C2 structured draft with toggles concept
C2 · Structured draft with toggles Tests whether giving pros surface-level content/tone/length controls changes how they feel about the same underlying draft.
C3 one-tap send from leads list concept
C3 · One-tap send from leads list A deliberate boundary test. Pushes past C1 to find where pros draw the line on trading context for speed.

Directional findings. Core patterns replicated across studies. Not a causal claim about conversion — the MBT will test whether the draft moves customer reply rate.

What customers do when seven pros message them

What replicated from the RQI

  • Effort-vs-confidence dynamic holds. When effort rises faster than confidence, customers defer. The prototype sessions let us watch this happen in real time: customers engaged warmly with Pro 3's thorough questions or dropped off sharply from Pro 5's high-effort intake bundle (photos + measurements + availability + budget up front).
  • Acknowledgment + one bounded next step continues to outperform. Pro 7 (mirroring + one question + credibility signal) and Pro 6 (acknowledgment + one clarifying question) were the consistent winners.
  • Early channel switching creates friction — but is not universally negative. Bryson read Pro 4's phone number offer as disintermediation and pushiness; Tanya (managing 23 properties) and Arpit read the same move as efficient and welcome. Timing and customer context moderate.

New or sharpened from customer concept testing

The effort-vs-confidence tradeoff is bimodal, not a curve.

Detail-oriented customers (Xuan, Melody, Barbara) treated Pro 3's thorough upfront questions as a competence signal — "any pro that's going to ask about all the details first strikes me as trustworthy." Others (NaQuia, Tanya) found the same pattern overwhelming and preferred Pro 7's middle path of lightweight mirroring plus staged questions. The implication is not "personalize to segment" (we can't reliably segment on this) but rather design the default to favor Pro 7's pattern, which was robust across both groups, and let detail-oriented customers self-select through follow-up.

Confidence Upfront detail level → Low Medium High Detail-oriented Efficiency-focused Robust across both Pro 7

Effort-vs-confidence is bimodal. Pro 7's middle-ground pattern was robust across both customer types.

Listening is the strongest trust signal and the easiest to break.

Bryson stated Saturday after 11am as his availability; Pro 1 offered Friday 2pm and Monday 10am. That single miss was unrecoverable in his session — not a trust discount, a disqualification. Across sessions, ignoring a stated detail (time, location, a direct question) was the single most consistent path to being dropped.

This finding has a direct hook into the prompt work. Two specific data inputs currently missing or underweighted would address it: (1) the specific date/time selections the customer made — not just the binary "did they select availability." Chelsia flagged that V6 outputs are still generating "when works for you?" questions in threads where the customer has already provided windows. (2) The multimodal query data — customer search query, photos, and captions — where customers encode signal that doesn't appear in structured fields. Chelsia pointed to a concrete failure pattern already visible in production: pros asking "what exactly did you want me to install" when the answer was in a photo the model can't see. That's the Bryson failure mode, at scale, in live threads.

"Just listen to me. If I'm telling you Saturday after 11, why are you telling me Friday and Monday?"

— Bryson, customer session
Customer wrote

…available Saturday after 11am. Looking for someone who can come same-day if possible.

Pro 1 responded

I can schedule you for Friday 2pm or Monday 10am. Let me know what works!

Bryson: disqualified.

Competitive shortlisting collapses the recovery window.

Once NaQuia and Arpit had identified 2–3 viable pros, they explicitly said they would not respond to the rest — even if those pros sent perfectly fine follow-ups. This is new information relative to the RQI framing, which positioned follow-up as a recovery mechanism. It is a recovery mechanism, but only for pros who made the shortlist on the first message. For pros who didn't, there's no recovery.

All 7 pros respond P1 P2 P3 P4 P5 P6 P7 Customer shortlists 2–3 P1 P2 P3 P4 P5 P6 P7 No recovery path. × × × × ×

"Once I had 3 viable options… I probably would not respond to the others."

— NaQuia, customer session

Bot detection is already happening.

Jay wrote intentionally vague prompts because he detected LLM generation and wanted to see how each pro handled ambiguity. Customers flagged AI-sounding language through: perfect grammar, instantaneous response times, em-dashes, uniform structure across pros, and — most importantly — verbatim echoing of request-form details. Grace Boatwright's feedback on the v3 prompt output independently flagged the same signals.

If customers can detect AI in a research session, they can detect it in production. Anti-redundancy becomes the difference between "Thumbtack helped me respond" and "Thumbtack is sending form letters on my behalf."

Pricing upfront is a significant trust builder — even when pros resist it.

Roslyn's session: Pro 6's price mention was the moment she moved that pro to top-of-list. Pros in the parallel study generally avoid pricing upfront because services are custom. There's a real tension here that v1 won't resolve, but worth flagging: the customer-side appetite for price clarity is stronger than most pro workflows currently accommodate.

Seven archetypes, three outcomes

Archetype Pattern Outcome
Pro 1 — No acknowledgment Friendly but generic Generally dropped. Disqualifying when stated details ignored.
Pro 2 — No clear next step Acknowledges but stalls Near-universal hard failure. Cognitive load transfers to customer.
Pro 3 — Full intake High-effort upfront Split. Trust signal for detail-oriented; overwhelming for others.
Pro 4 — Call-first Offers phone early Split. Works for efficiency-focused; pushy for others.
Pro 5 — Question-ignoring Redirects past asked questions Disqualifying. Closest analog to Pro 1.
Pro 6 — Acknowledgment + one question Mirrors lightly, asks one thing Strong and consistent.
Pro 7 — Mirroring + question + credibility Best-fit across both groups Most robust across all 9 sessions.

"There's nothing to respond to… the cognitive load is on me to carry on."

— Xuan, on Pro 2

What pros do when we hand them an AI draft

The C2 preference is about trust, not performance

Six of seven pros preferred C2 (structured draft with toggles) over C1 (plain draft). The obvious read is "pros want more control." The more precise read — which matters for product strategy — is that the toggles functioned as a trust mechanism, not a performance mechanism. Pros didn't describe the toggles changing the content of what they sent in meaningful ways. They described them as making them feel safer sending what was already there.

This matters because C2 can't ship in v1. The implication is not "build C2 anyway" — it's that the plain draft (C1) has to clear a higher bar on accuracy and voice than it would if pros had an in-context safety valve. If pros can't tweak quickly, they need to trust out of the box — which means output quality, voice detection, and verification affordances all get more weight.

An accuracy-verification affordance may substitute for control at lower cost. Grace's observed behavior (checking the lead details before sending) and Sarthak's proposal — a button that surfaces the original request details right in the compose step — map directly to the same underlying anxiety: "is this draft accurate to what this customer said?"

C1 — Plain draft
C1 — plain AI draft reply to Tracey, with 'Written by AI, tailored for you' label
Is this accurate to what she said?
Does it sound like me?
Same content. Different feeling. Different mechanism.
C2 — Structured draft with toggles
C2 — structured draft with Content, Tone, and Length toggles
Control signal
Safety valve

"It's a worthless tool if I can't have some input as to what it's generating."

— Michael, pro session

C3 was a unanimous rejection — with two overlapping reasons

Every pro said they'd always view lead details before sending. But the reason varied and both reasons matter:

  1. Category-driven rejection. Don (caricature artist), Monica (catering), and Emmanuel (commercial flooring) rejected C3 because their businesses sell specific dates or spaces — they can't commit without verifying availability.
  2. Universal rejection. Even pros without that constraint (Mark, Dan, Grace) said they'd always check the lead first.

This validates keeping the compose flow inside the lead details view rather than at the leads-list level — already the PRD direction. Worth flagging: C3 surfaced a secondary insight — pros value the view-details step as a decision-quality moment, not just an information-gathering one.

Reply from leads list
C3 — leads list with a direct Reply button, bypassing lead details

Too fast. No context.

Reply from lead details
Full lead details view with customer message, pricing, and request form answers

Context first.

The compose moment belongs here, not there.

"It's a trap to reply too fast."

— Grace, pro session

Voice, differentiation, and "AI-ed" language

Mark immediately read the v1 draft as "AI-ed." Don flagged em-dashes as an AI tell. Dan (Minnesota Headshots): "No one would ever write this stuff… that immediately takes confidence away." The concern cuts two ways:

  • Pro-side: if pros perceive the draft as AI-ish, they'll edit heavily (cost: adoption friction) or disable the feature entirely (cost: reach).
  • Customer-side: pros also worry that if every pro sends similar AI-drafted responses, customers will notice and trust the platform less — not just any individual pro. Dan flagged this as a platform-level risk.

This maps directly onto Grace Boatwright's v3 prompt feedback (excessive mirroring, overly formal language, uniformity across pros). The two data streams are pointing at the same thing from opposite ends — customers detect uniformity, pros detect uniformity, and both interpret it as a quality/authenticity failure.

"If all my competitors are clicking the same button, there is no differentiation."

— Mark, pro session

Next-step defaults and the tension with RQI priors

Pros have strong opinions about what the next step should be, and they almost all default to phone call. The RQI investigation positioned call-first escalation as a failure pattern when it outpaces confidence. The pro sessions show that call-first isn't just a habit — it's a rational risk-management move in pro workflows (qualify lead, reduce wasted spend, move off-platform to control pacing). The customer sessions show that call-first works for some customers and breaks for others, with timing as the moderator.

v1 descopes pro preferences for next-step type. The research suggests this is the right call for the initial MBT — we need to learn what works for customers before opening up pro customization — but we should plan to surface next-step control relatively quickly as a fast-follow, and communicate to pros that the tool is defaulting to what the data says works, not overriding their judgment.

What changes for the plan

The moves that follow from both studies together. These are the changes to v1 scope and priority that the research supports.

1. The plain draft needs to nail four things to be defensible in v1

Because we're shipping without toggle-based control, the default output has to earn trust without a safety valve.

1

Lightweight mirroring

"Saw you're looking for a 60-gallon electric water heater replacement — that's a project I can likely turn around same-week."

1–3 high-signal details from free-text, not a form transcript. Varied across pros.

2

Next step as a question

"Does Thursday afternoon work for a 15-min call to confirm the unit size?"

Low cognitive cost to reply. Ball in customer's court. Not: "I can schedule Thursday."

3

No boilerplate credibility

"As a highly experienced plumber with top reviews…"

They've already seen your reviews. Credibility comes from asking the right question.

4

Voice variation across pros

Vary: length, which details get mirrored, sentence structure. Different across pros responding to the same lead.

Cheapest anti-redundancy move. Doesn't require rewriting voice per pro.

Reflected in the prompt and data inputs.

The PRD's data inputs table now excludes pro bio, years in business, licenses, service categories, past jobs, targeted categories, and quote sheet — with the explicit rationale that these were generating sales-y content and that customers have already seen this context on the pro's profile page. Credibility is being engineered as behavioral (through the question asked), not through self-promotional language in the message itself. V7 of the prompt operationalizes lightweight mirroring (1–3 details) and one bounded next step as a question.

2. Anti-redundancy should be P0, not P1

  • Customers compare messages in parallel and will detect uniform output.
  • Pros won't adopt a tool that flattens their differentiation.
  • Cheapest way in is to vary which details get mirrored and the way each response is structured — not to rewrite each pro's voice from scratch, which is costly.

3. Verification affordance as a lower-cost substitute for toggles

Sarthak's proposal — a button on the compose step that surfaces the original request details — maps directly to Grace's observed verification behavior (checking the lead details before sending). The underlying anxiety is the same: "is this draft accurate to what this customer said?"

Sarthak raised this exact proposal in-session during 4/15 pro research. It's on the team's radar — recommending we scope for v1 if timeline permits, v1.1 fast-follow otherwise. This is a lower-cost trust mechanism than building the full C2 toggle UI, and it addresses the same root cause.

4. The pro-side communications layer matters more than it looks

This connects directly to Cailee's GTM work. Key moves:

  • Chelsia's original framing of the NUX goal (influence pro behavior and understanding, not blind adoption) aligns with what pros told us.
  • Pros who don't understand why the tool is making a particular choice will override toward their own default (usually a call).
  • The 4/21 legal review removed the heavy AI-disclosure requirement, which loosens the legal case for a NUX but strengthens the behavioral one.

Where the research lands on the NUX decision

A P1 lightweight education NUX has been scoped (not launch blocking). What the research speaks to is less whether to have one and more what it should carry and where it should live. Session evidence pointed consistently to awareness before first exposure — Grace, Michael, Monica, and Mark all expressed wanting advance notice about what the tool does, not a pop-up at the compose moment. The distinction matters: preparedness can be delivered through Cailee's GTM emails/pushes; a compose-level surface risks crowding the trust-sensitive moment when the pro is already forming an opinion about the draft itself.

Research lean: consolidate awareness into the GTM layer; if an in-product surface is valuable on top of that, scope it upfunnel (jobs tab, lead detail) rather than at compose.

Email / Push
Awareness territory
Jobs tab / Lead detail
Awareness territory
Compose moment
Trust-sensitive — don't crowd

Top three behavioral rationales to carry through the GTM and NUX layers

1
Listening matters most — the tool is built around not asking for things the customer already gave you.
2
One bounded next step as a question — customers reply when it's easy; they stall when the response dead-ends.
3
Differentiation is intentional — the tool is designed to make each pro's response distinct, not interchangeable.

Cailee's campaign brief submitted 4/23 incorporates this framing.

5. Separate signal restoration from intervention design — visibility still matters

The RQI investigation's hard constraint — ~25–30% of pro responses are never viewed — still holds and is not addressable via response quality. This work should not be measured against outcomes in threads where the customer never saw the response. Primary success metric (% projects with customer reply within 24hr of first pro response) correctly gates on the response existing and being seen; worth confirming the measurement excludes unseen messages cleanly.

Two prompt input gaps the research pointed to — and where we are on each

Research surfaced two specific prompt input gaps that map to the Bryson-style listening failure: specific date/time selections the customer made (not just "did they provide availability"), and multimodal query data (search query, photo presence, captions). Status as of the V7 prompt build:

✅ Multimodal data Photos (binary) and captions are now in prompt context.
🟡 Specific date/time selections In progress. Scheduling data is being separated from urgency into its own prompt column; urgency values are moving from binary to granular.

Closing the remaining gap is what separates a draft that asks "when works for you?" in a thread where the customer has already given their windows, from a draft that uses those windows in the next-step question.

Both are on the V7 track ahead of April 29 prompt finalization.

What this doesn't settle

Open tensions

These are tensions the research surfaced but doesn't resolve. They need product/strategy judgment, not more research (for now).

Trust mechanism Performance mechanism

Do we invest in toggles post-v1 primarily to drive adoption (trust) or because they measurably improve outputs (performance)? The research suggests the former. If that's right, the investment case is different — it's an adoption lever, not a content lever.

Call-first default Data-driven default

If the data shows that asking a follow-up question gets more replies than suggesting a call, should the draft default to the follow-up question even when pros would have chosen a call? Becomes a communication and education question, not just a product one.

Differentiation Quality floor

If we vary outputs enough to avoid uniformity, some variations will be worse than others. The customer research suggests variance has real value — customers used the differences to evaluate fit — but it's a trade we haven't explicitly made.

Pricing transparency Pro resistance

Customer-side pull is real; pro-side resistance is real. v1 doesn't need to resolve it; post-MBT does.

What's out of scope for this synthesis

To be explicit about what this doc isn't claiming.

  • Not a causal claim about conversion. Both studies were designed to check whether response patterns hold up, not to run an A/B test. The MBT will test whether the draft actually moves customer reply rate.
  • Not a scale-readiness assessment. v1 is unlikely to be scale-ready in its current form and a v2 iteration is likely. This doc informs what v1 needs to clear to be a useful learning test.
  • Not an endorsement of Pro 7's specific language. The win was about pattern (lightweight mirroring + question + signal), not wording. The prompt work should aim at the pattern, not the text.
  • Not a complete picture of the visibility layer. The RQI investigation established unseen messages as a structural gate; this work builds on that but doesn't address it.

What happens next

Recommended next steps

Immediate — before April 29 prompt finalization

  • Revise prompt to weight customer free-text message above request form fields. V7 prompt.
  • 🟡Add specific date/time selections to the prompt context. Scheduling separated into its own column per 4/23 AS standup; urgency moving from binary to granular.
  • Add multimodal query data (customer search query, photo presence + captions). Photos (binary) and captions now in prompt context.
  • Implement lightweight mirroring constraint: 1–3 details max, varied across pros. V7 prompt — explicit in writing rules.
  • Reframe bounded next step as a question in the prompt. V7 prompt — "Ask at least 1 strong next-step question, ideally end with it."
  • Deprioritize generic credibility language. PRD data inputs now exclude pro bio, years in business, licenses, service categories, past jobs, targeted categories, and quote sheet.
  • 🟡Scope the compose-step "show lead details" verification button for v1 or earliest fast-follow. Proposed by Sarthak 4/15 — in team discussion.

Eval sample composition

  • Ensure the eval set includes a mix of booking archetypes and job natures, a subset of pros with no past conversation history, and multiple pros per request_pk. Chelsia's 4/23 test cases spec covers this — categories handpicked across job natures and booking archetypes, pros without past history explicitly scoped.

Pre-MBT — pro-facing communications

  • Cailee's GTM content to lead with the top three behavioral rationales — listening, bounded next step as question, differentiation. Campaign brief submitted 4/23.
  • Consolidate awareness into the GTM layer; if an in-product surface is valuable, scope it upfunnel. P1 NUX scoped, not at compose — aligned with research lean.
  • Design pro opt-out path. Per PRD: inbound requests only, case-by-case, removed from experiment.

Post-MBT

  • Next-step control as v1.1 fast-follow. Still pending.
  • Voice training on pro message history — plan the roadmap explicitly. Still pending, currently P1 per PRD.
  • Pricing transparency as a separate workstream. Still pending; flagged for post-MBT.

Measurement

  • Primary metric should explicitly gate on response visibility, not just response sent. Already in PRD: "% of projects where a customer sends at least one message within 24 hours of the pro's initial reply."
  • Add guardrail on pro "disable feature" rate as an adoption signal. Still pending.

Research's own next steps

  • 🟡Refresh the IRR analysis on the updated tags, values, and decision rules. In progress — next up after findings sync.

Appendix

Archetype performance across customer sessions (directional)

Archetype Pattern Net reception
Pro 1 — No acknowledgment Friendly but generic; no mirroring of stated details Generally dropped by most participants. Becomes disqualifying when stated details (time, location) are ignored. Lower harm ceiling than Pro 2 or Pro 5 because at least it doesn't actively contradict.
Pro 2 — No clear next step Acknowledges request but ends without a clear action Near-universal hard failure. Cognitive load transfers to the customer to figure out the next move. Xuan's "there's nothing to respond to" quote captures this exactly.
Pro 3 — Full intake High-effort upfront: multiple detailed questions in first message Bimodal. Detail-oriented participants (Xuan, Melody, Barbara) found it thorough and trustworthy. Efficiency-focused participants (NaQuia, Tanya) found it overwhelming. Not universally safe.
Pro 4 — Call-first Offers phone number or call invite in first message Bimodal. Efficiency-focused customers read it as professional and time-saving. Others read it as pushy or disintermediating. Timing and customer context are the key moderators.
Pro 5 — Question-ignoring Redirects away from a question the customer directly asked Disqualifying. Same family of failure as Pro 1, but more active — it doesn't just skip acknowledgment, it ignores what the customer directly asked. Hardest failure outside of Pro 2.
Pro 6 — Acknowledgment + one question Mirrors 1–2 specific details, asks one bounded clarifying question Strong and consistent. Worked across most participant types. No disqualifying moments. Low effort to respond. Robust second-best.
Pro 7 — Mirroring + question + credibility signal Lightweight mirroring, one question, implicit credibility via question quality Most robust across all 9 sessions. The only archetype that consistently worked across both detail-oriented and efficiency-focused customers. Credibility signal landed through pattern, not through self-promotional language.

Verbatims referenced throughout

Expand quote inventory

"Once I had 3 viable options… I probably would not respond to the others."

NaQuia — customer session

"Just listen to me. If I'm telling you Saturday after 11, why are you telling me Friday and Monday?"

Bryson — customer session

"Any pro that's going to ask about all the details first strikes me as trustworthy."

Xuan — customer session

"There's nothing to respond to… the cognitive load is on me to carry on."

Xuan — customer session, on Pro 2

"It's a worthless tool if I can't have some input as to what it's generating."

Michael — pro session

"You're giving me the control."

Turk — pro session

"It's a trap to reply too fast."

Grace — pro session

"If all my competitors are clicking the same button, there is no differentiation."

Mark — pro session

"No one would ever write this stuff… that immediately takes confidence away."

Dan — pro session

"I like to be prepared for changes that are coming."

Grace — pro session, on advance notice

"If the data showed that, that would change how I use the tool."

Jen — pro session, on call-first default vs. data-driven default