MIT found 95% of enterprise generative-AI pilots deliver no measurable P&L impact. That statistic should reframe how you choose an AI partner. The gap between a demo and a result is rarely the model — modern models are remarkably capable out of the box. The gap is integration into real workflows, clear ownership, honest measurement, and the unglamorous engineering that keeps a system reliable once real users hit it. The right questions up front tell you which side of that 95% a partner will land you on, because they force the conversation away from the demo and onto the parts that actually decide outcomes.
A note on how to use these: don't just collect answers, listen for specificity. A strong partner answers with concrete examples and trade-offs. A weak one answers with adjectives.
On capability and fit
- 1.What have you shipped to production — not demoed? Ask for live AI systems with real users, real failure modes, and real uptime. A demo proves the happy path works once; production proves the team can handle the other 20% of cases that break everything. (Here's ours.)
- 2.Is this built on your own architecture, or a thin wrapper? Understand what's "under the hood" — proprietary, open-source, or a reseller of someone's API. A wrapper isn't automatically bad, but you should know what you're paying a margin on and how much lock-in comes with it.
- 3.How will you measure success in business terms? If the answer is "accuracy" rather than a P&L line — cost saved, revenue added, hours returned — push harder. Vanity metrics are how pilots end up in the 95%. (Why ROI is the real skill.)
On data, security and IP
- 1.Who owns the model, the prompts and the outputs? It should be you, in writing. Prompts and fine-tuned weights are real IP; don't let them sit in a grey area.
- 2.What happens to our data — during and after? Contracts should spell out exactly what the vendor can and can't do with your data, including whether it's used to train anything and what happens when the engagement ends.
- 3.How do you handle security and compliance? Encryption in transit and at rest, access controls, and your industry's regime (HIPAA, SOC 2, GDPR). For AI specifically, ask how they handle prompt injection and data leakage through the model itself.
- 4.How do you prevent hallucinations and handle errors in production? Listen for evaluation harnesses, guardrails, retrieval grounding and human-in-the-loop where the stakes are high. "The model is very accurate" is not an answer.
On cost and commitment
- 1.What's the full running cost — including the maintenance tail? Inference, monitoring, prompt upkeep and re-evaluation as models change are all ongoing. AI systems have a steeper running-cost curve than traditional software, and a quote that ignores it is incomplete. (The AI cost curve.)
- 2.Can we start with a paid pilot? A small proof-of-concept is the best "try before you buy at scale" there is — and for AI, where outcomes are genuinely uncertain, it's close to mandatory.
- 3.Who specifically works on our account, and how senior are they? AI work rewards judgement; you want named senior people, not a generic "team."
On the long game
- 1.What happens if we want to bring this in-house later? A confident partner documents the system and makes the handover easy. Evasiveness here is a lock-in signal.
- 2.Show me a project that went wrong — what did you do? How a team handles failure tells you more than any case study. Every honest AI shop has a story; be wary of the one that claims it doesn't.
The red flags to watch for
- An impressive demo but no production reference you can actually contact.
- Success defined in model metrics rather than a business outcome.
- Vagueness about data usage, IP ownership, or what happens at the end of the contract.
- Pressure to commit to a full build before any pilot has proven value.
A short scenario
Imagine two vendors pitching the same support-automation project. Both demos look great. Vendor A, asked question 3, says the bot is "92% accurate." Vendor B says they'd target a 30% reduction in tickets reaching a human while holding customer-satisfaction scores flat, and they'd instrument both numbers from day one. Asked question 12, Vendor A has never had a project go wrong; Vendor B describes a deployment where the model confidently gave wrong answers, how they caught it with evaluation, and the human-in-the-loop fallback they added. On the demo alone the two are indistinguishable. On the answers, one is clearly the partner who keeps you out of the 95% — and the difference only surfaces because the questions forced it.
What to actually do
Send these twelve questions to every vendor on your shortlist and compare the answers side by side; the comparison itself is more revealing than any single answer. Then run a small paid pilot scoped to one real workflow with a measurable target — a number you'd be glad to hit. Decide on what the pilot proves, not on the polish of the proposal. This is the single most reliable way to stay out of the 95%, and it costs far less than discovering the gap after a full build. (Why so many pilots stall before value is the larger story here.)
A partner who answers all twelve crisply is rare — and worth far more than the one with the slickest demo.
If you're scoping an AI build, bring us these questions — we'd rather you ask them of everyone you're considering, including us.
Sources
- Fortune / MIT NANDA — 95% of GenAI pilots show no measurable return
- Netguru — How to evaluate AI vendors: a guide for CTOs