Role Assessment

LLM & RAG Assessment for Data Scientists

Assessing whether they can build AI systems, not just use chatbots

The gap between data scientists who use LLMs and those who understand them is wider than in any other role. A data scientist scoring above 8 on AISA can explain why their RAG pipeline chunks documents at 512 tokens with 50-token overlap, which embedding model they chose and why, and how they evaluate retrieval quality beyond cosine similarity. Below 5, they typically describe using ChatGPT for exploratory analysis. AISA makes this spectrum visible.

What We Assess

Specific AI competencies we probe through natural conversation, tailored for data scientists.

RAG Architecture & Implementation

Can they design a retrieval-augmented generation pipeline from scratch? We probe chunking strategies, embedding model selection, vector database trade-offs, re-ranking approaches, and how they handle documents that do not fit neatly into chunks. The difference between a textbook answer and production experience shows clearly in how they discuss edge cases.

Fine-Tuning Decision Framework

When should you fine-tune versus prompt-engineer versus RAG? We test whether data scientists have a decision framework for this critical choice. Fine-tuning is expensive and often unnecessary; we assess whether they default to it from habit or choose it deliberately based on data volume, task specificity, and latency requirements.

Evaluation & Metrics Design

How do you measure whether an LLM system is performing well when there is no single correct answer? We look for familiarity with LLM-as-judge approaches, human evaluation protocols, retrieval precision and recall, factual grounding metrics, and the practical challenge of building evaluation sets that actually represent production traffic.

Production System Awareness

Can they think beyond the notebook? We assess understanding of inference cost, latency budgets, model serving architectures, caching strategies, monitoring for model drift, and the operational reality of keeping an LLM system running in production. Data scientists who score well here have deployed systems, not just trained them.

Dimension Focus

AISA scores across five dimensions. Here is how they weight for data scientists.

Technical Understanding

20%

LLM architecture, embedding models, RAG pipelines, fine-tuning methods, tokenization, and context window management. We assess whether they have a working mental model of these systems — not textbook definitions, but practical understanding that informs design decisions. Can they explain why you would choose LoRA over full fine-tuning for a specific use case? Do they understand the trade-offs between dense and sparse retrieval?

Critical Thinking

22%

Evaluation methodology is where data scientists differentiate themselves. We test whether they can design evaluation frameworks for generative outputs (where ground truth is fuzzy), identify when benchmark numbers are misleading, analyse failure modes systematically, and recognise when an LLM solution is worse than a simpler baseline. The best data scientists maintain healthy skepticism about model performance claims.

Prompting & Communication

23%

System prompt design, few-shot example selection, chain-of-thought structuring, and output formatting. For data scientists, prompting is an engineering discipline, not a creative exercise. We assess whether they can construct prompt pipelines that produce consistent, parseable outputs — and whether they understand why prompt engineering is often a better first step than fine-tuning.

Prompting

Technical Understanding

Workflow

Critical Thinking

Safety

What Good vs. Poor Looks Like

Patterns from data scientist assessments. The technical depth gap in this role is the most pronounced we see.

Strong signal (score 7-10)

+Can walk through a RAG pipeline they built and explain each design decision: why that embedding model, why that chunk size, what re-ranking approach, how they measured retrieval quality, and what failure modes they encountered in production.
+Has a clear decision framework for prompt engineering vs. fine-tuning vs. RAG. Can articulate the cost, data, and latency trade-offs for each approach on a specific problem. Has chosen not to fine-tune when it was the obvious but wrong answer.
+Discusses evaluation methodology with nuance: knows that BLEU and ROUGE are insufficient for LLM evaluation, has experience with LLM-as-judge frameworks, and understands the bootstrapping problem of building evaluation sets for novel tasks.
+Thinks about production concerns unprompted: inference cost per query, p99 latency, caching strategies for common queries, monitoring for embedding drift, and fallback behavior when the model is unavailable or returning garbage.

Weak signal (score 1-4)

-Describes RAG as "connecting a model to documents" without understanding the pipeline components. Cannot explain chunking strategies, embedding model selection, or how retrieval quality affects generation quality.
-Defaults to fine-tuning for every problem. Has not considered that prompt engineering or RAG might solve the problem at 1/10th the cost. Treats fine-tuning as the "real" ML approach and prompting as a hack.
-Has no evaluation strategy for generative outputs. "We had users rate the quality" is the extent of their methodology. No mention of automated metrics, evaluation set design, or systematic error analysis.
-All experience is in notebooks and prototypes. No awareness of inference cost, latency constraints, or the operational complexity of keeping an LLM system running. Cannot discuss what happens after the model is "done."

The Conversation Approach

AISA does not ask data scientists to write code or solve math problems. It has a systems design conversation — the kind you would have when evaluating a candidate's ability to architect and evaluate AI systems.

A typical data scientist assessment starts with their recent LLM-related work, digs into the architecture decisions they made, explores how they evaluated the system, and then presents scenarios that test their ability to reason about unfamiliar problems. The conversation might ask them to design a RAG system for a domain they have not worked in, or to diagnose a described failure mode in an LLM pipeline. The goal is not right answers — it is rigorous thinking.

This matters because traditional data science interviews (take-homes, whiteboard statistics, SQL tests) do not test LLM-specific competency at all. A candidate might excel at classical ML and struggle to design a production RAG system. The conversational format lets AISA probe the depth of understanding that distinguishes someone who has read about LLMs from someone who has built with them. For more on AISA's scoring methodology, see the AISA Rubric documentation.

The Hiring Context

The data science talent market has shifted dramatically. Two years ago, "ML experience" meant classical supervised learning — regression, classification, gradient boosting. Today, most data science job postings include LLM, RAG, or generative AI as requirements. The supply of candidates who actually have production LLM experience is far smaller than the number who claim it.

Making this worse, LLM skills are hard to verify through traditional interviews. You cannot whiteboard a RAG pipeline. Take-home assignments test coding ability, not system design judgment. And resumes that list "LangChain" and "vector databases" tell you nothing about whether the candidate understands why they are using those tools or just following a tutorial.

AISA provides a structured 25-minute assessment that specifically targets LLM and generative AI competency. The report distinguishes between surface-level familiarity and genuine system-building capability, with quoted evidence for every score. For teams building AI platforms and products, this signal is the one that traditional pipelines miss entirely. For the full context on how AI skills gaps affect hiring, read our 2026 AI Skills Report.

Why It Matters

Data science hiring has traditionally focused on statistical fundamentals, ML algorithms, and coding proficiency. These still matter, but the field has shifted. Teams now need data scientists who can build and evaluate LLM-powered systems — and the skill set for that is distinct from classical ML. A strong traditional data scientist might struggle with prompt engineering, RAG architecture, or evaluating generative outputs. AISA provides evidence of LLM-specific competency that traditional interviews miss, giving hiring managers a clear signal on whether a candidate can contribute to the AI systems your team is actually building.

Start assessing data scientists

One conversation. Evidence-based scoring across five dimensions. A report you can actually use to make hiring decisions.

The Science Behind AISA

In 2026, Anthropic published the AI Fluency Index — the largest empirical study of AI fluency to date, analysing 9,830 conversations. AISA covers 93% of the behaviours Anthropic identified as markers of AI fluency and goes even deeper with 4 additional dimensions.Read our white paper: Anthropic's AI Fluency Study & AISA

AISA's framework is developed by a team with deep roots in tech, behavioural science, and AI product leadership — the rubric is informed by backgrounds spanning the Metropolitan Police, Harvard, Crowdbotics (Silicon Valley), and the European School of Economics.

LLM & RAG Assessment for Data Scientists

What We Assess

RAG Architecture & Implementation

Fine-Tuning Decision Framework

Evaluation & Metrics Design

Production System Awareness

Dimension Focus

Technical Understanding

Critical Thinking

Prompting & Communication

What Good vs. Poor Looks Like

Strong signal (score 7-10)

Weak signal (score 1-4)

The Conversation Approach

The Hiring Context

Why It Matters

Further Reading

The AISA Rubric

AI-Native Hiring Guide

Beyond Multiple Choice

AI Skills Leaderboard

AI Coach

AI Readiness for Teams

Start assessing data scientists