Published. Validated. Auditable.
A conversational AI assessment built on behavioural science, validated by independent research, and designed for predictive validity — not multiple choice.
Why conversational assessment
Multiple-choice AI tests have three fundamental validity problems. AISA solves all of them architecturally.
AI can answer its own questions
Conversational evidence can’t be looked up
Recognition ≠ proficiency
We score demonstrated behaviour, not recall
No evidence trail
Every score is tied to a verbatim quote
Full analysis: How conversational evidence prevents cheating
Independently validated
In 2026, Anthropic published the AI Fluency Index — the largest empirical study of AI fluency to date, analysing 9,830 conversations. AISA's rubric was developed independently.
of Anthropic's observable fluency markers covered by AISA
conversations analysed in Anthropic's study
additional dimensions AISA measures that Anthropic couldn't
The four dimensions Anthropic couldn't measure through chat logs alone — AI Fundamentals, Tool Landscape, Domain Application, and Safety — require a structured conversational assessment to surface. AISA's dual-track architecture makes this possible.
The published rubric
5 dimensions, 11 criteria, published and auditable. Every score has a behavioural anchor — you can see exactly what a 5 looks like versus a 7 versus a 9.
Dimension weights
Score scale (1–10)
5 dimensions · 11 criteria
- P1 — Prompt Design
- P2 — Iterative Dialogue
- P3 — Context & Memory Management
- T1 — Output Evaluation
- T2 — Limitation Awareness
- U1 — AI Fundamentals
- U2 — Tool Landscape
- W1 — Workflow Integration
- W2 — Task Decomposition
- W3 — Domain Application
- S1 — AI Safety & Responsibility
Full rubric with behavioural anchors: The AISA Rubric — 5 Dimensions of AI Proficiency
How scoring works
Three layers of scoring, each adding reliability. No single model has the final say.
Evidence extraction
Track B scores every candidate message against the rubric, extracting verbatim quotes with confidence levels. Evidence is classified: demonstrated > described > managed.
Confidence-weighted aggregation
Multiple evidence pieces per criterion are blended by confidence (high/medium/low). Peak performance is weighted alongside sustained performance — a single brilliant answer counts, but consistency matters more.
Holistic calibration pass
A more capable model (Claude Opus) reviews the full transcript and adjusts scores that the per-turn evaluator got wrong. It must provide disconfirming evidence for every adjustment.
See expert scoring in practice: What a Score 9 looks like · Full validity audit
The dual-track architecture
Separating conversation from evaluation eliminates the bias that occurs when a single system asks questions and judges answers.
Track A
Conversationalist
The only AI the candidate sees. Warm, adaptive, peer-level. Gets steering notes from the evaluator but never sees scores. Natural dialogue, not a checklist.
Track B
Evaluator
Runs silently on every message. Evidence, scores, steering notes — structured data only. Behavioural anchors (1–10) keep scoring consistent and explainable.
Track B evaluates before Track A replies — the next response already reflects the latest steering.
Technical deep-dive: Inside AISA's assessment architecture
What the scores reveal
Predictive validity means scores produce real, differentiating insights — not just a number. Here's what 400+ assessments have surfaced.
completed assessments
average fluency score
real score range
reach Expert tier
How High Scorers Navigate Multi-Step Workflows
What a 7+ looks like in practice vs a 4-5.
Recovering from Bad AI Output
How proficient candidates course-correct vs. mediocre ones.
What 400 Assessments Reveal About Developers
Strong on prompting, weaker than expected on technical understanding.
Safety: The 10% That Reveals the Most
Why the lowest-weighted dimension is the most diagnostic.
Workflow: Separating Talkers from Operators
The highest-weighted dimension and what it measures.
Why Task Decomposition Separates Experts from Novices
W2 scoring applied to real candidate sessions.
Why scores can't be faked
If scores can be gamed, they have no predictive validity. AISA solves this at the architecture level, not with proctoring.
Burst detection
Characters appearing in <50ms windows signal paste, not typing. Human keystrokes are 50–300ms apart.
Style analysis
Baseline vocabulary and formality shift mid-session. Sudden corporate prose after casual answers = flagged.
AI fingerprinting
Five-metric system detecting AI-generated text: correction rate, edit density, message length, formality, uniformity.
Consistency verification
The same topic probed from multiple angles across the session. Rehearsed frameworks crumble under varied questioning.
Typing metrics weigh 70%, style and AI signals 30%. Flags appear in the report with full transparency — integrity is an architectural property, not a policing function.
The 10 AI personas
Beyond the score: a profile of how someone interacts with AI, based on the shape of their dimension scores — not just the composite number. Two people can score identically and receive different personas.
Deep technical mastery of AI itself. Understands or builds AI models, works with ML and LLMs at a technical level. Elite critical analysis comes from understanding the technology at its foundation, not just from using it.
Designs and builds sophisticated multi-system AI integrations at scale. Goes beyond creating individual tools to engineering production-grade architectures where AI components interact with each other and non-AI systems.
Personally created complex, useful tools, workflows, or products using AI — whether for their own use, their company, or commercially. Developed deep practical understanding through hands-on building that goes beyond secondhand knowledge.
Uses AI heavily across complex workflows, automations, and multi-tool pipelines. Understands AI limitations well and knows which tool integrates with which. Orchestrates and configures sophisticated setups, but typically works with what's available rather than building novel tools from scratch.
Productive with mainstream AI tools and uses them well within established workflows. Communicates clearly with AI and consistently gets quality output, but typically hasn't pushed into the cutting edge of AI tooling or complex integrations.
Actively building AI skill across multiple dimensions. Tries new tools, refines prompts, and is beginning to develop repeatable patterns — the trajectory is strong.
Approaches AI with critical caution. May under-use AI in practice, but the verification instinct and risk awareness form a strong foundation that many frequent users lack.
Relies on AI for day-to-day output but with limited iteration or verification. Gets value, but leaves quality and safety gains on the table by accepting first-pass results.
Experiments with AI intermittently: a prompt here, a quick question there. Nothing sustained, but a willingness to explore that many skip entirely.
Has heard of AI tools but hasn't meaningfully engaged — the assessment itself may be the most direct interaction to date. Awareness exists; habit does not.
Personas reflect interaction style — usage patterns, habits, and mindset. They correlate with the score but don't directly map to it.
Full profiles: The 10 AI Persona Types
Built on assessment science
AISA is built against the same standards that govern clinical and occupational assessments worldwide. We published a transparent self-audit.
Validity
5/5Scores measure what they claim to measure. Conversational evidence, not recall. Published rubric, not a black box.
Reliability
4.5/5Consistent results across sessions. Three-layer scoring, confidence weighting, and calibration pass reduce noise.
Fairness
4.5/5Role-adaptive questioning, multilingual support, no demographic proxies in scoring.
Transparency
5/5Published rubric, evidence-linked scores, auditable methodology. Every score comes with the quote that produced it.
Standards referenced: AERA/APA/NCME (2014) · ISO 10667 · Schmidt & Hunter (1998) · Messick (1989) · Sackett et al. (2022)
Full self-audit with ratings and evidence: Inside AISA's Assessment Framework
See what the report looks like
Every assessment produces a detailed report with per-criterion scores, evidence quotes, a persona profile, and personalised development guidance.
The Science Behind AISA
In 2026, Anthropic published the AI Fluency Index — the largest empirical study of AI fluency to date, analysing 9,830 conversations. AISA covers 93% of the behaviours Anthropic identified as markers of AI fluency and goes even deeper with 4 additional dimensions.Read our white paper: Anthropic's AI Fluency Study & AISA
AISA's framework is developed by a team with deep roots in tech, behavioural science, and AI product leadership — the rubric is informed by backgrounds spanning the Metropolitan Police, Harvard, Crowdbotics (Silicon Valley), and the European School of Economics.
The Science Behind AISA
In 2026, Anthropic published the AI Fluency Index — the largest empirical study of AI fluency to date, analysing 9,830 conversations. AISA covers 93% of the behaviours Anthropic identified as markers of AI fluency and goes even deeper with 4 additional dimensions.Read our white paper: Anthropic's AI Fluency Study & AISA
AISA's framework is developed by a team with deep roots in tech, behavioural science, and AI product leadership — the rubric is informed by backgrounds spanning the Metropolitan Police, Harvard, Crowdbotics (Silicon Valley), and the European School of Economics.