IRQ Research Findings

DARE / Resolution

The first response
is the audition.

What 9 customers and 10 pros told us about AI-assisted initial responses on Thumbtack.

Customer concept testing (N=9) + Pro concept testing (N=10). Directional findings, core patterns replicated across studies.

~15 min read

What the two studies, together, tell us

  1. 01

    The first response carries more leverage than we previously gave it credit for.

    The RQI investigation positioned follow-up as higher leverage than first-response on the assumption that many first responses never reach a decision threshold. Customer concept testing sharpens that: once a customer has 2–3 viable pros in hand, the lower-ranked responses have no recovery path. The first response is effectively the audition. This doesn't invalidate follow-up as a learning surface, but it raises the bar on what v1 of IRQ has to get right.

  2. 02

    "Acknowledgment + one bounded next step" replicates as the success pattern — with two critical refinements.

    (a) Acknowledgment has to be lightweight — 1–3 high-signal details, varied across pros, not a transcript of the request form. Over-mirroring feels mechanical and breaks the trust signal it's meant to create. (b) The bounded next step should be a question, not a declarative statement, so the cognitive cost of replying is low and the ball is in the customer's court.

  3. 03

    "No next step" is an unrecoverable failure. Ignoring a stated detail is an unrecoverable failure.

    These are the two hardest failure modes across all 9 customer sessions. Everything else is gradient.

  4. 04

    Pros preferred the toggled draft (Concept 2), but the mechanism is trust, not performance.

    The toggles didn't change what pros sent so much as how confident they felt sending it. Since C2 can't ship in v1 on the current timeline, the plain draft must clear a higher bar on accuracy and voice to compensate — pros who don't trust the output will delete it or disable the feature.

  5. 05

    Anti-redundancy across pros is a product requirement, not a nice-to-have.

    Both sides surfaced this independently: customers detect AI when multiple pros send near-identical messages; pros refuse to adopt a tool that flattens their differentiation. Currently scoped P1 — the research suggests it's closer to P0 for adoption.

  6. 06

    Credibility is behavioral, not boilerplate.

    Generic credibility signals ("highly experienced technicians," "top reviews") are largely invisible by the time the customer is reading the message — they've already seen reviews on the SP. What builds trust in-message is demonstrating competence through the right question.

Two studies, designed to triangulate

Nine customer sessions had participants submit a real project request and interact with seven pros on a prototype, each pro trained on a distinct response archetype (5 failure patterns, 2 success patterns) distilled from the 600-thread RQI investigation. Participants ranked and reasoned about pro responses at the end. Ten pro sessions tested three AI-assisted response concepts: C1 plain AI draft (closest to the v1 PRD scope), C2 structured draft with content/tone/length toggles, and C3 one-tap send from the leads list as a deliberate boundary test.

The two studies were designed to triangulate. Customer sessions stress-tested the behavioral content the tool needs to produce. Pro sessions stress-tested the workflow and control around how it gets produced and sent. The customer study is a behavioral-validation study of the archetypes and the RQI model. The pro study is a divergent-prototype study intended to surface behavioral limits — not to converge on a preferred UI.

Customer study

C 1 customer P1 P2 P3 P4 P5 P6 P7 7 archetypes · ranked by participants

Pro study

P 1 pro C1 Plain draft C2 Toggles ★ preferred C3 One-tap send 3 concepts · workflow + control

Directional findings. Core patterns replicated across studies. Not a causal claim about conversion — the MBT will test whether the draft moves customer reply rate.

What customers do when seven pros message them

What replicated from RQI

  • Effort-vs-confidence dynamic holds. When effort rises faster than confidence, customers defer. The prototype sessions let us watch this happen in real time: customers engaged warmly with Pro 3's thorough questions or dropped off sharply from Pro 5's high-effort intake bundle (photos + measurements + availability + budget up front).
  • Acknowledgment + one bounded next step continues to outperform. Pro 7 (mirroring + one question + credibility signal) and Pro 6 (acknowledgment + one clarifying question) were the consistent winners.
  • Early channel switching creates friction — but is not universally negative. Bryson read Pro 4's phone number offer as disintermediation and pushiness; Tanya (managing 23 properties) and Arpit read the same move as efficient and welcome. Timing and customer context moderate.

New or sharpened from customer concept testing

The effort-vs-confidence tradeoff is bimodal, not a curve.

Detail-oriented customers (Xuan, Melody, Barbara) treated Pro 3's thorough upfront questions as a competence signal — "any pro that's going to ask about all the details first strikes me as trustworthy." Others (NaQuia, Tanya) found the same pattern overwhelming and preferred Pro 7's middle path of lightweight mirroring plus staged questions. The implication is not "personalize to segment" (we can't reliably segment on this) but rather design the default to favor Pro 7's pattern, which was robust across both groups, and let detail-oriented customers self-select through follow-up.

Confidence Upfront detail level → Low Medium High Detail-oriented Efficiency-focused Robust across both Pro 7

Effort-vs-confidence is bimodal. Pro 7's middle-ground pattern was robust across both customer types.

Listening is the strongest trust signal and the easiest to break.

Bryson stated Saturday after 11am as his availability; Pro 1 offered Friday 2pm and Monday 10am. That single miss was unrecoverable in his session — not a trust discount, a disqualification. Across sessions, ignoring a stated detail (time, location, a direct question) was the single most consistent path to being dropped.

This finding has a direct hook into the prompt work. Two specific data inputs currently missing or underweighted would address it: (1) the specific date/time selections the customer made — not just the binary "did they select availability." Chelsia flagged that V6 outputs are still generating "when works for you?" questions in threads where the customer has already provided windows. (2) The multimodal query data — customer search query, photos, and captions — where customers encode signal that doesn't appear in structured fields. Chelsia pointed to a concrete failure pattern already visible in production: pros asking "what exactly did you want me to install" when the answer was in a photo the model can't see. That's the Bryson failure mode, at scale, in live threads.

"Just listen to me. If I'm telling you Saturday after 11, why are you telling me Friday and Monday?"

— Bryson, customer session
Customer wrote

…available Saturday after 11am. Looking for someone who can come same-day if possible.

Pro 1 responded

I can schedule you for Friday 2pm or Monday 10am. Let me know what works!

Bryson: disqualified.

Competitive shortlisting collapses the recovery window.

Once NaQuia and Arpit had identified 2–3 viable pros, they explicitly said they would not respond to the rest — even if those pros sent perfectly fine follow-ups. This is new information relative to the RQI framing, which positioned follow-up as a recovery mechanism. It is a recovery mechanism, but only for pros who made the shortlist on the first message. For pros who didn't, there's no recovery.

All 7 pros respond P1 P2 P3 P4 P5 P6 P7 Customer shortlists 2–3 P1 P2 P3 P4 P5 P6 P7 No recovery path. × × × × ×

"Once I had 3 viable options… I probably would not respond to the others."

— NaQuia, customer session

Bot detection is already happening.

Jay wrote intentionally vague prompts because he detected LLM generation and wanted to see how each pro handled ambiguity. Customers flagged AI-sounding language through: perfect grammar, instantaneous response times, em-dashes, uniform structure across pros, and — most importantly — verbatim echoing of request-form details. Grace Boatwright's feedback on the v3 prompt output independently flagged the same signals.

If customers can detect AI in a research session, they can detect it in production. Anti-redundancy becomes the difference between "Thumbtack helped me respond" and "Thumbtack is sending form letters on my behalf."

Pricing upfront is a significant trust builder — even when pros resist it.

Roslyn's session: Pro 6's price mention was the moment she moved that pro to top-of-list. Pros in the parallel study generally avoid pricing upfront because services are custom. There's a real tension here that v1 won't resolve, but worth flagging: the customer-side appetite for price clarity is stronger than most pro workflows currently accommodate.

Seven archetypes, three outcomes

Archetype Pattern Outcome
Pro 1 — No acknowledgment Friendly but generic Generally dropped. Disqualifying when stated details ignored.
Pro 2 — No clear next step Acknowledges but stalls Near-universal hard failure. Cognitive load transfers to customer.
Pro 3 — Full intake High-effort upfront Split. Trust signal for detail-oriented; overwhelming for others.
Pro 4 — Call-first Offers phone early Split. Works for efficiency-focused; pushy for others.
Pro 5 — Question-ignoring Redirects past asked questions Disqualifying. Closest analog to Pro 1.
Pro 6 — Acknowledgment + one question Mirrors lightly, asks one thing Strong and consistent.
Pro 7 — Mirroring + question + credibility Best-fit across both groups Most robust across all 9 sessions.

"There's nothing to respond to… the cognitive load is on me to carry on."

— Xuan, on Pro 2

What pros do when we hand them an AI draft

The C2 preference is about trust, not performance

Six of seven pros preferred C2 (structured draft with toggles) over C1 (plain draft). The obvious read is "pros want more control." The more precise read — which matters for product strategy — is that the toggles functioned as a trust mechanism, not a performance mechanism. Pros didn't describe the toggles changing the content of what they sent in meaningful ways. They described them as making them feel safer sending what was already there.

This matters because C2 can't ship in v1. The implication is not "build C2 anyway" — it's that the plain draft (C1) has to clear a higher bar on accuracy and voice than it would if pros had an in-context safety valve. If pros can't tweak quickly, they need to trust out of the box — which means output quality, voice detection, and verification affordances all get more weight.

An accuracy-verification affordance may substitute for control at lower cost. Grace's observed behavior (toggling back to verify against the lead details before sending) and Sarthak's proposal — a button that surfaces the original request details right in the compose step — map directly to the same underlying anxiety: "is this draft accurate to what this customer said?"

C1 — Plain draft
Draft response

Hi Sarah, thanks for reaching out about your bathroom remodel. I've helped many homeowners in the Westside area with similar projects and would love to schedule a time to discuss the specifics. Looking forward to hearing from you!

Send
Is this accurate to what she said?
Does it sound like me?
Same content. Different feeling. Different mechanism.
C2 — Structured draft with toggles
Draft response

Hi Sarah, thanks for reaching out about your bathroom remodel. I've helped many homeowners in the Westside area with similar projects and would love to schedule a time to discuss the specifics. Looking forward to hearing from you!

ToneWarmProfessionalDirect
LengthShortMedium
Send
Control signal
Safety valve

"It's a worthless tool if I can't have some input as to what it's generating."

— Michael, pro session

C3 was a unanimous rejection — with two overlapping reasons

Every pro said they'd always view lead details before sending. But the reason varied and both reasons matter:

  1. Category-driven rejection. Don (caricature artist), Monica (catering), and Emmanuel (commercial flooring) rejected C3 because their businesses sell specific dates or spaces — they can't commit without verifying availability.
  2. Universal rejection. Even pros without that constraint (Mark, Dan, Grace) said they'd always check the lead first.

This validates keeping the compose flow inside the lead details view rather than at the leads-list level — already the PRD direction. Worth flagging: C3 surfaced a secondary insight — pros value the view-details step as a decision-quality moment, not just an information-gathering one.

Reply from leads list
Leads
Sarah M.Bathroom remodelSend ↗
James K.Deck installationSend ↗

Too fast. No context.

Reply from lead details
Sarah M. — Bathroom remodel
Project:Master bath renovation
Timeline:Starting next month
Budget:$8,000–12,000
Draft response ↗

Context first.

"It's a trap to reply too fast."

— Grace, pro session

Voice, differentiation, and "AI-ed" language

Mark immediately read the v1 draft as "AI-ed." Don flagged em-dashes as an AI tell. Dan (Minnesota Headshots): "No one would ever write this stuff… that immediately takes confidence away." The concern cuts two ways:

  • Pro-side: if pros perceive the draft as AI-ish, they'll edit heavily (cost: adoption friction) or disable the feature entirely (cost: reach).
  • Customer-side: pros also worry that if every pro sends similar AI-drafted responses, customers will notice and trust the platform less — not just any individual pro. Dan flagged this as a platform-level risk.

This maps directly onto Grace Boatwright's v3 prompt feedback (excessive mirroring, overly formal language, uniformity across pros). The two data streams are pointing at the same thing from opposite ends — customers detect uniformity, pros detect uniformity, and both interpret it as a quality/authenticity failure.

"If all my competitors are clicking the same button, there is no differentiation."

— Mark, pro session

Next-step defaults and the tension with RQI priors

Pros have strong opinions about what the next step should be, and they almost all default to phone call. The RQI investigation positioned call-first escalation as a failure pattern when it outpaces confidence. The pro sessions show that call-first isn't just a habit — it's a rational risk-management move in pro workflows (qualify lead, reduce wasted spend, move off-platform to control pacing). The customer sessions show that call-first works for some customers and breaks for others, with timing as the moderator.

v1 descopes pro preferences for next-step type. The research suggests this is the right call for the initial MBT — we need to learn what works for customers before opening up pro customization — but we should plan to surface next-step control relatively quickly as a fast-follow, and communicate to pros that the tool is defaulting to what the data says works, not overriding their judgment.

What changes for the plan

The moves that follow from both studies together. These are the changes to v1 scope and priority that the research supports.

1. The plain draft needs to nail four things to be defensible in v1

Because we're shipping without toggle-based control, the default output has to earn trust without a safety valve.

1

Lightweight mirroring

"Saw you're looking for a 60-gallon electric water heater replacement — that's a project I can likely turn around same-week."

1–3 high-signal details from free-text, not a form transcript. Varied across pros.

2

Next step as a question

"Does Thursday afternoon work for a 15-min call to confirm the unit size?"

Low cognitive cost to reply. Ball in customer's court. Not: "I can schedule Thursday."

3

No boilerplate credibility

"As a highly experienced plumber with top reviews…"

They've already seen your reviews. Credibility comes from asking the right question.

4

Voice variation across pros

Vary: length, which details get mirrored, sentence structure. Different across pros responding to the same lead.

Cheapest anti-redundancy move. Doesn't require rewriting voice per pro.

2. Anti-redundancy should be P0, not P1

  • Customers compare messages in parallel and will detect uniform output.
  • Pros won't adopt a tool that flattens their differentiation.
  • Cheapest intervention is varying which details are mirrored and the structural shape of each response, not rewriting voice per pro (expensive).

3. Verification affordance as a lower-cost substitute for toggles

Sarthak's proposal — a button on the compose step that surfaces the original request details — maps directly to Grace's observed verification behavior (toggling back to check accuracy before sending). The underlying anxiety is the same: "is this draft accurate to what this customer said?"

Recommend scoping for v1 if timeline permits; at minimum, v1.1 fast-follow. This is a lower-cost trust mechanism than building the full C2 toggle UI, and it addresses the same root cause.

4. The pro-side communications layer matters more than it looks

This connects directly to Cailee's GTM work. Key moves:

  • Chelsia's original framing of the NUX goal (influence pro behavior and understanding, not blind adoption) aligns with what pros told us.
  • Pros who don't understand why the tool is making a particular choice will override toward their own default (usually a call).
  • The 4/21 legal review removed the heavy AI-disclosure requirement, which loosens the legal case for a NUX but strengthens the behavioral one.

Lower-confidence position

Session evidence supports awareness before first exposure over reinforcement at point of use: Grace, Michael, Monica, and Mark all expressed wanting advance notice about what the tool does, not a pop-up at the compose moment. The distinction is between preparedness (can be done via GTM) vs. just-in-time disclosure (comes too late and crowds the trust-sensitive compose moment). Lean toward consolidating awareness into Cailee's GTM layer rather than a compose-level NUX. If an in-product surface is still valuable, scope it upfunnel.

Email / Push
Awareness territory
Jobs tab / Lead detail
Awareness territory
Compose moment
Trust-sensitive — don't crowd

Top three behavioral rationales for the GTM layer

1
Listening matters most — the tool is built around not asking for things the customer already gave you.
2
One bounded next step as a question — customers reply when it's easy; they stall when the response dead-ends.
3
Differentiation is intentional — the tool is designed to make each pro's response distinct, not interchangeable.

5. Separate signal restoration from intervention design — visibility still matters

The RQI investigation's hard constraint — ~25–30% of pro responses are never viewed — still holds and is not addressable via response quality. The IRQ work should not be measured against outcomes in threads where the customer never saw the response. Primary success metric (% projects with customer reply within 24hr of first pro response) correctly gates on the response existing and being seen; worth confirming the measurement excludes unseen messages cleanly.

Two prompt inputs that would close the listening gap

Chelsia flagged that V6 outputs are still asking customers for windows they already provided. Grace flagged that multimodal query data isn't in context yet. These are the two concrete, testable hooks the research points to.

  1. Specific date/time selections — not just the binary "did the customer provide availability," but which windows. The Bryson failure mode, productionized.
  2. Multimodal query data — search query, photo presence, captions. Where customers encode signal that doesn't make it into structured fields.
Current prompt context
Request form fields Pro past messages Thread history
Specific date/time windows
Multimodal query data

April 29 prompt finalization deadline. Both inputs are in scope if added now.

What this doesn't settle

Open tensions

These are tensions the research surfaced but doesn't resolve. They need product/strategy judgment, not more research (for now).

Trust mechanism Performance mechanism

Do we invest in toggles post-v1 primarily to drive adoption (trust) or because they measurably improve outputs (performance)? The research suggests the former. If that's right, the investment case is different — it's an adoption lever, not a content lever.

Call-first default Data-driven default

If the data shows that asking a follow-up question gets more replies than suggesting a call, should the draft default to the follow-up question even when pros would have chosen a call? Becomes a communication and education question, not just a product one.

Differentiation Quality floor

If we vary outputs enough to avoid uniformity, some variations will be worse than others. The customer research suggests variance has real value — customers used the differences to evaluate fit — but it's a trade we haven't explicitly made.

Pricing transparency Pro resistance

Customer-side pull is real; pro-side resistance is real. v1 doesn't need to resolve it; post-MBT does.

What's out of scope for this synthesis

To be explicit about what this doc isn't claiming.

  • Not a pro readout on its own. The pro study is still being synthesized separately with more depth on workflow, opt-out, and education.
  • Not a causal claim about conversion. Both studies are behavioral validation, not A/B tests. The MBT will test whether the draft actually moves customer reply rate.
  • Not a scale-readiness assessment. v1 is unlikely to be scale-ready in its current form and a v2 iteration is likely. This doc informs what v1 needs to clear to be a useful learning test.
  • Not an endorsement of Pro 7's specific language. The win was about pattern (lightweight mirroring + question + signal), not wording. The prompt work should aim at the pattern, not the text.
  • Not a complete picture of the visibility layer. RQI established unseen messages as a structural gate; this work builds on that but doesn't address it.

What happens next

Recommended next steps

Immediate — before April 29 prompt finalization

  1. Revise prompt to weight customer free-text message above request form fields.
  2. Add specific date/time selections to the prompt context.
  3. Add multimodal query data (customer search query, photo presence + captions).
  4. Implement lightweight mirroring constraint: 1–3 details max, varied across pros.
  5. Reframe bounded next step as a question in the prompt.
  6. Deprioritize generic credibility language; test prompt versions with and without.
  7. Scope the compose-step "show lead details" verification button for v1 or earliest fast-follow.

Eval sample composition

  1. Ensure the eval set includes a mix of booking archetypes and job natures, a subset of pros with no past conversation history, and multiple pros per request_pk (anti-redundancy evaluation).

Pre-MBT — pro-facing communications (this week / near-term)

  1. Cailee's fast-turnaround GTM content (1–2 pre-launch emails + pushes) should lead with the top three behavioral rationales — listening, bounded next step as question, differentiation — framed as general response-quality coaching pros can apply with or without the tool.
  2. Lean toward consolidating awareness into the GTM layer rather than a compose-level NUX. If an in-product surface is still valuable, scope it upfunnel.
  3. Design pro opt-out path.

Post-MBT

  1. Next-step control as v1.1 fast-follow.
  2. Voice training on pro message history — plan the roadmap explicitly.
  3. Pricing transparency as a separate workstream.

Measurement

  1. Primary metric should explicitly gate on response visibility, not just response sent.
  2. Add guardrail on pro "disable feature" rate as an adoption signal.

Research's own next steps

  1. Refresh the RQI IRR analysis on the original tags, values, and decision rules.

Appendix

Archetype performance across customer sessions (directional)

Archetype Pattern Net reception
Pro 1 — No acknowledgment Friendly but generic; no mirroring of stated details Generally dropped by most participants. Becomes disqualifying when stated details (time, location) are ignored. Lower harm ceiling than Pro 2 or Pro 5 because at least it doesn't actively contradict.
Pro 2 — No clear next step Acknowledges request but ends without a clear action Near-universal hard failure. Cognitive load transfers to the customer to figure out the next move. Xuan's "there's nothing to respond to" quote captures this exactly.
Pro 3 — Full intake High-effort upfront: multiple detailed questions in first message Bimodal. Detail-oriented participants (Xuan, Melody, Barbara) found it thorough and trustworthy. Efficiency-focused participants (NaQuia, Tanya) found it overwhelming. Not universally safe.
Pro 4 — Call-first Offers phone number or call invite in first message Bimodal. Efficiency-focused customers read it as professional and time-saving. Others read it as pushy or disintermediating. Timing and customer context are the key moderators.
Pro 5 — Question-ignoring Redirects away from a question the customer directly asked Disqualifying. Closely analogous to Pro 1's failure mode but more active — it doesn't just fail to acknowledge, it demonstrably ignores. Hardest failure outside of Pro 2.
Pro 6 — Acknowledgment + one question Mirrors 1–2 specific details, asks one bounded clarifying question Strong and consistent. Worked across most participant types. No disqualifying moments. Low effort to respond. Robust second-best.
Pro 7 — Mirroring + question + credibility signal Lightweight mirroring, one question, implicit credibility via question quality Most robust across all 9 sessions. The only archetype that consistently worked across both detail-oriented and efficiency-focused customers. Credibility signal landed through pattern, not through self-promotional language.

Verbatims referenced throughout

Expand quote inventory

"Once I had 3 viable options… I probably would not respond to the others."

NaQuia — customer session

"Just listen to me. If I'm telling you Saturday after 11, why are you telling me Friday and Monday?"

Bryson — customer session

"Any pro that's going to ask about all the details first strikes me as trustworthy."

Xuan — customer session

"There's nothing to respond to… the cognitive load is on me to carry on."

Xuan — customer session, on Pro 2

"It's a worthless tool if I can't have some input as to what it's generating."

Michael — pro session

"You're giving me the control."

Turk — pro session

"It's a trap to reply too fast."

Grace — pro session

"If all my competitors are clicking the same button, there is no differentiation."

Mark — pro session

"No one would ever write this stuff… that immediately takes confidence away."

Dan — pro session

"I like to be prepared for changes that are coming."

Grace — pro session, on advance notice

"If the data showed that, that would change how I use the tool."

Jen — pro session, on call-first default vs. data-driven default