The AISA Rubric: 5 Dimensions of AI Proficiency
A deep dive into AISA's 11-criterion scoring framework across 5 dimensions — what each measures, how scoring works, and why articulating the 'why' matters more than the 'what'.
The single strongest predictor of a high AISA score is not tool knowledge, not prompt complexity, and not technical vocabulary. It is the candidate's ability to articulate why they make the choices they make. Candidates who consistently explain the reasoning behind their approach — why they structured a prompt a certain way, why they chose one tool over another, why they distrusted a particular output — consistently outscore candidates who demonstrate the same behaviors without the accompanying reasoning. The rubric is designed to capture this distinction: two people can take the same action, but the one who can explain the principle behind it demonstrates a fundamentally different level of proficiency.
This article is a complete walkthrough of the AISA scoring framework: 5 dimensions, 11 criteria, a 1–10 scale per criterion, and a weighted composite that produces a 0–100 proficiency score. If you are an engineering manager evaluating candidates, an L&D leader benchmarking your team, or a builder curious about where you stand, this is the reference document.
Why These Five Dimensions
Most AI skill frameworks fall into one of two traps. They either reduce proficiency to a single axis — "Can this person write good prompts?" — or they enumerate dozens of micro-skills that are impossible to assess reliably. We designed AISA's rubric to capture the minimum viable set of dimensions that, together, predict whether someone can do useful, responsible work with AI systems.
The five dimensions emerged from analyzing how effective AI practitioners actually work. Not how they describe their work in a resume, but what they do in live problem-solving. We observed that strong performers share a common loop: they communicate clearly with AI systems, evaluate the outputs critically, understand the underlying technology well enough to diagnose failures, integrate AI into structured workflows, and operate within safety and ethical boundaries. Remove any one of these and the practitioner becomes unreliable in predictable ways.
The weighting reflects real-world impact. Workflow & Application carries the highest weight (25%) because the gap between knowing AI concepts and actually shipping AI-assisted work is where most teams stall. Safety & Responsibility carries the lowest weight (10%) not because it is unimportant, but because it functions as a threshold: candidates who score below 4 on safety exhibit patterns that disqualify them from unsupervised AI work regardless of their other scores.
Each dimension contains two or three criteria. This structure gives assessors enough resolution to identify specific strengths and gaps without creating a scoring matrix so large that inter-rater reliability collapses. Eleven criteria is the sweet spot we landed on after testing frameworks with as few as 6 and as many as 20.
Dimension 1: Prompting & Communication (23%)
This dimension measures how effectively a candidate communicates with AI systems. It is not about memorizing prompt templates. It is about whether the candidate treats the AI as a reasoning system that responds to structure, specificity, and context — or as a search bar.
P1: Prompt Design
P1 evaluates the initial construction of prompts. At the Novice level (1–2), candidates type vague, conversational requests and accept whatever comes back. At Competent (5–6), they reliably include role framing, constraints, output format specifications, and examples. At Expert (9–10), they demonstrate principle-level understanding of why certain structures work — they can predict how changes to a prompt will affect output quality, and they adjust their approach based on the model they are using.
The key differentiator at the upper end is not sophistication for its own sake. It is intentionality. An expert-level prompt might be shorter than a competent-level prompt, but every element is there for a reason the candidate can explain.
P2: Iterative Dialogue
P2 measures the candidate's ability to refine AI outputs through multi-turn conversation. This is where many candidates who score well on P1 plateau. They can write a strong initial prompt but do not know how to diagnose why an output missed the mark, or how to steer the conversation without starting over.
Proficient candidates (7–8) treat the conversation as a feedback loop. They reference specific parts of previous outputs, they escalate or narrow the scope deliberately, and they know when to abandon a thread and restructure. Expert-level performance here looks like someone pair-programming with a junior developer — they know how to ask the right follow-up question to get the response they need in one or two turns.
P3: Context & Memory Management
P3 addresses how candidates manage the practical constraints of AI conversations: context windows, conversation history, and information persistence. A Novice does not know these constraints exist. A Competent candidate breaks large tasks into smaller conversations. An Expert designs their workflow around context management — they know when to summarize previous context, when to use system prompts, and how to structure information so that the model retains what matters.
This criterion has become increasingly important as AI tools have moved from single-shot interactions to persistent, session-based workflows. Candidates building with tools like developer-focused AI assistants need this skill daily.
Dimension 2: Critical Thinking (22%)
Critical Thinking is the dimension that most sharply separates the top quartile from everyone else. It measures whether a candidate can evaluate AI outputs with the same rigor they would apply to any other information source — and whether they understand where AI systems systematically fail.
T1: Output Evaluation
T1 scores the candidate's ability to assess whether an AI output is correct, complete, and fit for purpose. At the low end, candidates accept outputs at face value or reject them based on gut feeling. At the high end, candidates apply structured evaluation: they check outputs against known facts, they probe edge cases, they test for internal consistency, and they can articulate what "good enough" looks like for a given use case.
The practical test for T1 is whether the candidate catches errors that the AI introduced confidently. Our assessment conversations are designed to surface moments where the AI facilitator provides a plausible but flawed response. How the candidate handles that moment is one of the highest-signal data points in the entire assessment.
T2: Limitation Awareness
T2 measures whether the candidate understands the systematic limitations of current AI systems — hallucination, training data cutoffs, reasoning failures in specific domains, sensitivity to prompt framing, and the difference between pattern matching and genuine understanding.
A Developing candidate (3–4) knows that AI "can be wrong sometimes." A Proficient candidate (7–8) can predict when AI is likely to be wrong for a specific task and adjusts their workflow accordingly. They do not waste time asking an LLM to do reliable arithmetic, and they do not blindly trust AI-generated code without testing it.
This criterion matters enormously for product managers and designers who are making decisions about where to deploy AI in products their users will rely on.
Dimension 3: Technical Understanding (20%)
Technical Understanding does not require a machine learning degree. It requires enough conceptual grounding to make informed decisions about AI tools and to diagnose problems when they arise.
U1: AI Fundamentals
U1 measures whether the candidate understands the core concepts behind the AI systems they use. This includes a working mental model of how large language models generate text, what training data means for output quality, the difference between fine-tuning and prompting, and why temperature and other parameters affect outputs.
We are not looking for candidates to recite the transformer architecture. We are looking for whether their mental model is accurate enough to be useful. A data scientist should score higher here than a product manager, but every role benefits from a mental model that helps them predict system behavior.
U2: Tool Landscape
U2 evaluates the candidate's awareness of the current AI tool ecosystem and their ability to select appropriate tools for specific tasks. This is not a trivia test — we do not care whether someone can name every model on the market. We care whether they understand the tradeoffs: when to use a general-purpose LLM versus a specialized model, when an API integration beats a chat interface, when a RAG pipeline is the right architecture versus fine-tuning.
The scoring here rewards candidates who have hands-on experience with multiple tools and can explain why they chose one over another. A candidate who has only used one AI tool but understands its strengths and limitations deeply will outscore a candidate who name-drops ten tools but cannot articulate a selection rationale.
Dimension 4: Workflow & Application (25%)
This is the highest-weighted dimension because it measures what ultimately matters: can this person get real work done with AI? Knowing how AI works and being able to use AI to ship outcomes are different skills, and the gap between them is larger than most people assume.
W1: Workflow Integration
W1 measures how the candidate incorporates AI tools into their existing work processes. At the low end, AI use is ad hoc — the candidate opens ChatGPT when they are stuck and closes it when they get an answer. At the high end, AI is woven into a structured workflow with defined handoff points, quality gates, and human review stages.
Proficient candidates can describe their AI workflow as a repeatable process. Expert candidates have optimized that process — they know which steps benefit from AI involvement, which steps require human judgment, and they have built personal systems (templates, checklists, automated pipelines) that encode these decisions.
W2: Task Decomposition
W2 evaluates the candidate's ability to break complex problems into AI-appropriate subtasks. This is a core engineering skill applied to a new context. The question is not "Can you use AI to do this task?" but "Can you figure out which parts of this task AI can handle, which parts it cannot, and how to reassemble the pieces?"
Strong task decomposition is especially critical for developers working on AI-assisted codebases. The difference between a developer who asks AI to "build me a login system" and one who decomposes the problem into authentication flow design, token management, UI components, and test cases — and knows which subtasks to delegate to AI — is the difference between unreliable output and production-ready code.
W3: Domain Application
W3 scores the candidate's ability to apply AI tools effectively within their specific professional domain. A developer using AI for code review is exercising different domain knowledge than a designer using AI for user research synthesis. W3 measures whether the candidate adapts their AI usage to the norms, constraints, and quality standards of their field.
This is where generalist AI knowledge meets specialist expertise. The highest scores go to candidates who have developed domain-specific AI workflows — not just using AI generically, but leveraging it in ways that reflect deep understanding of their field's requirements.
How Scoring Works
Each of the 11 criteria is scored on a 1–10 scale. The scores are not arbitrary — they map to five proficiency bands that describe observable behaviors:
- 1–2 (Novice): The candidate is unaware that this is a distinct skill. They may use AI but show no intentionality.
- 3–4 (Developing): The candidate is aware of the skill and attempts to apply it, but inconsistently.
- 5–6 (Competent): The candidate demonstrates functional, repeatable techniques. They can do the thing reliably.
- 7–8 (Proficient): The candidate is intentional and can explain why their approach works. They adapt to novel situations.
- 9–10 (Expert): The candidate operates at a principle level. They have internalized the skill so deeply that it has reshaped how they think about their work.
The composite score is calculated by weighting each dimension according to its assigned percentage (Prompting 23%, Critical Thinking 22%, Technical Understanding 20%, Workflow 25%, Safety 10%) and normalizing to a 0–100 scale.
Critically, scores are based on evidence observed during the conversation, not self-reported claims. AISA's dual-track architecture means that the scoring AI independently evaluates every response — it does not take the candidate's word for anything. If a candidate claims to be an expert prompt engineer but writes vague, unstructured prompts during the assessment, the score reflects the observed behavior.
Score Interpretation: The Five Bands in Practice
Understanding what each band looks like in practice helps managers and L&D leaders act on AISA results.
A team with an average composite score of 35 (Developing) is a team where most people have tried AI tools but have not built reliable habits around them. Training should focus on foundational skills: structured prompting, output verification, and basic tool selection. This is where most teams land today, and the gap is largest in Critical Thinking — people use AI but do not systematically evaluate what it gives them.
A team averaging 55 (Competent) has functional AI skills. They can get work done with AI tools. The upskilling opportunity here is moving from "I know how to use this" to "I know why this works and when it will fail." Targeted training on limitation awareness and workflow optimization yields the highest ROI at this level.
A team averaging 75 (Proficient) is operating at a level where AI is genuinely integrated into their work. These teams benefit more from peer learning, advanced tool exploration, and domain-specific AI application workshops than from foundational training.
For a deeper analysis of how scores distribute across roles and dimensions, see our 2026 AI Skills Report. For guidance on turning score data into upskilling plans, read The AI Skills Gap: How to Benchmark and Upskill Your Existing Team.
Why This Rubric Works
The AISA rubric is designed around three principles that distinguish it from simpler AI skill assessments.
First, it measures behavior, not knowledge. A candidate who knows the definition of "few-shot prompting" but never uses it effectively scores lower than a candidate who structures examples into their prompts intuitively. The conversational format makes this distinction possible — you cannot fake behavior over a 25-minute adaptive conversation the way you can select the right answer on a multiple-choice test.
Second, it separates dimensions that are often conflated. Many teams assume that someone who writes good prompts also evaluates outputs carefully. In practice, these skills are independent — a candidate can be a Proficient prompt engineer and a Developing critical thinker. The rubric's multi-dimensional structure surfaces these asymmetries so that hiring and training decisions can be precise.
Third, it scales across roles. The same rubric applies to developers, product managers, designers, and data scientists, but the conversation adapts to each role's context. A developer's W3 score reflects AI application in software engineering; a designer's W3 score reflects AI application in design workflows. The criteria are universal; the evidence is domain-specific.
The rubric is not a static document. As AI tools evolve and the baseline of "what every professional should know" shifts, the criteria descriptions and band definitions will be updated. But the five dimensions — communication, critical thinking, technical understanding, workflow integration, and safety — represent durable categories of AI proficiency that will remain relevant regardless of which specific tools dominate the market.
Ready to try the AI skills assessment yourself?