Quick Reference#

Brief definitions for key terms in the Agent Ecosystem

A#

abstraction#

label and/or concept that bundles together a set of underlying components or capabilities
streamlines communication by hiding implementation details
understanding what an abstraction hides is often necessary for diagnosing unexpected behavior
“agent” is an abstraction for a collection of distinct parts

A/B test#

also known as split testing and/or randomized controlled trial
commonly used by tech companies to test features, interfaces, or algorithms
experimental method that compares two versions of something to determine which performs better
randomly assigns participants to treatment or control groups

affinity mapping#

commonly used in UX research and design thinking to synthesize findings
qualitative research method for organizing and grouping related ideas or observations
participants sort data points - notes, quotes, themes - into clusters based on natural relationships

agent#

autonomous system that perceives environment, makes decisions, and takes actions to achieve goals
typically LLM-based system that can use tools, maintain memory, and execute multi-step tasks
capabilities include reasoning, planning, tool use, memory management, and interaction
distinct from chatbots through autonomy and task execution abilities

Agent-as-a-Judge#

evaluation methodology where an AI agent assesses the performance of other agents
agent evaluator examines outputs, behaviors, or decision-making processes
enables scalable evaluation compared to human-only assessment
related to LLM-as-a-Judge, but focuses on agent-level evaluation rather than just text outputs

agent skill#

bundle of instructions and reference material that gives an agent just-in-time context for a specific domain or task
distinct from hooks and slash commands in that skills work through LLM interpretation rather than deterministic execution
subject to the same context window attention dynamics as other injected content

assistant message#

output generated by an agent or AI model during a conversational turn
paired with user messages to form the back-and-forth history the agent uses as context
related terms: turn, user message

automation#

use of technology to perform tasks with minimal human intervention
can range from basic rule-based systems to complex machine learning models
in AI context - delegation of decision-making or execution to algorithms, robots, or automated agents

B#

benchmark#

standardized test or dataset used to evaluate and compare system performance
provides consistent metrics across different models, agents, or approaches
task completion rates, accuracy scores, reasoning capabilities
enables objective comparison and tracks progress in the field

C#

canary phrase#

named after canaries used in coal mines as early warning detectors
unique marker string embedded in content to verify its presence in a system
its appearance in output confirms that specific content was loaded and/or processed

codebook#

structured guide used in qualitative research to categorize and tag data consistently
defines categories, codes, and rules for applying them to text or observations
ensures systematic analysis across multiple researchers or datasets

Cohen’s kappa coefficient#

statistical measure of inter-rater agreement for categorical items
measures the level of agreement between two raters while accounting for chance agreement

conceptual AI experiment#

test type in which AI exists as a label or framing device, but no AI is actually implemented
typically uses vignettes or scenarios to model operational principles or consequences of AI
advantages: high feasibility, easy to scale and replicate, can study impractical or impossible scenarios
disadvantages: lower naturalness since subjects don’t interact with actual AI

confidence interval#

range of values likely to contain the true effect size, given the statistical model assumptions
commonly reported as 95% confidence interval - if computed repeatedly under valid conditions, 95% will contain the true value
width indicates precision of estimate; narrower intervals mean more precise estimates
not to be confused with “95% probability the true value is in this range” for any single interval

context window#

total amount of text, measured in tokens, an LLM can process at once
includes system prompt, conversation history, and any injected context
information outside the context window isn’t directly available to the model during a given interaction

context window management#

agent platform strategies to handle conversations that approach or exceed the context window limit
determine what gets retained, compressed, or dropped as conversations grow long
common approaches include summarization and/or selective truncation of earlier messages
quality of strategy affects whether an agent may “forget” earlier instructions

controlled vs natural#

experimental design distinction based on environment
trade-off between control/replicability and external validity/generalizability
controlled: experiments conducted in artificial settings - labs, online platforms - where researchers manipulate variables
natural: experiments conducted in real-world settings where AI is actually used - workplaces, platforms, markets

cost-efficiency#

evaluation metric measuring computational resources required relative to task performance
factors include token usage, API calls, processing time, energy consumption
increasingly important as agents scale to production environments
trade-off: higher accuracy often requires higher costs

D#

dissemination#

systematic sharing of research findings with target audiences beyond the research team
ensures knowledge can advance the field, change practice and policy, or inform future research
requires planning for audience, timing, and appropriate communication channels
methods include journal publications, conference presentations, social media, press releases, websites

E#

EDD#

acronym for Evaluation-driven Development
software development methodology where evaluation guides design and iteration
incorporates continuous assessment of agent capabilities, reliability, and safety
testing and metrics inform architectural decisions throughout development lifecycle
emphasizes measurable outcomes and systematic improvement

edge case#

critical for testing AI reliability and robustness
scenario or condition that occurs at extreme operating parameters or unusual circumstances
falls outside normal operating conditions but within specified boundaries
examples: unusual inputs, rare combinations of factors, boundary conditions

empirical testing#

validation approach based on observation and experimentation rather than theory alone
uses real data and measurable outcomes to evaluate hypotheses
applies algorithms with actual users, tasks, or environments to measure performance

experimental design#

systematic planning of how to conduct an experiment to answer a research question
goal is to isolate causal effects while minimizing confounding factors
defines variables, treatments, control conditions, randomization, and measurement approach
includes decisions about sample size, data collection methods, and analysis approach

F#

Final Response Evaluation#

evaluation methodology that assesses only the end result or output of an agent’s execution
judges success based on whether final answer or outcome is correct
advantages: simple to implement, clear success criteria
limitations: provides no insight into reasoning process, intermediate steps, or failure points

Flesch–Kincaid readability tests#

designed to indicate how difficult a passage in English is to understand
score reflects the U.S. grade level needed to comprehend the text

G#

gate#

prompt condition that must be satisfied before work sequence continues
provides objectively evaluable agentic checkpoints: thing happens → condition → then proceed
different than hooks, which are triggered by events in the harness
contrasts with rules, in which LLMs can interpret, bypass, rationalize around

Goodhart’s law#

originally an economics principle, now widely applied to AI and agent systems
“when a measure becomes a target, it ceases to be a good measure”
describes phenomenon where optimizing for a proxy metric leads to gaming the metric rather than improving underlying quality
critical concern: agents may learn to maximize benchmark scores without developing genuine capabilities
examples: reward hacking, benchmark overfitting, specification gaming

Gunning fog index#

readability test that estimates the years of formal education needed to understand text on first reading
score of 12 indicates high school senior level

H#

hallucination#

critical quality control concern in AI systems
type of AI output that’s false, fabricated, or unsupported information
appears plausible, but isn’t grounded in training data or provided context

harness#

platform layer that wraps around an LLM
provides configuration, permission settings, system prompts, tools
may include code search, file operations, shell execution, web access, content management strategy, and temperature settings
agents using the same underlying model can behave very differently depending on their harness

heuristic#

practical problem-solving approach that uses shortcuts or rules of thumb to find satisfactory solutions
differs from algorithms that guarantee optimal solutions
trades optimality for speed and feasibility when exhaustive search is impractical
in agent systems, guide decision-making when perfect information or unlimited computation is unavailable
examples: A* search, greedy algorithms, hand-crafted evaluation functions

hook#

script or callback that runs automatically in response to a specific event in the agent’s environment
fires deterministically based on triggers, such as a file being edited, without going through the LLM’s interpretation loop
useful for enforcing constraints reliably without relying on the agent to remember to do them

human-in-the-loop#

system design where humans actively participate in AI decision-making or evaluation process
human provides feedback, validation, or intervention at critical points
balances automation with human judgment and oversight
common in agent evaluation to assess quality, safety, and alignment with human values

L#

LLM#

abbreviation for Large Language Model
also known as “the agent’s brain”
AI model trained on vast amounts of text data to understand and generate human language
not all AI is LLM-based - such as computer vision models, recommendation systems
examples: GPT - Generative Pre-trained Transformer, Claude, and Llama

LLM-as-a-Judge#

evaluation methodology where a large language model assesses quality of text outputs
LLM scores or ranks responses based on criteria like accuracy, helpfulness, or safety
enables scalable evaluation compared to human annotation alone
limitations include potential biases and consistency issues in LLM judgments

M#

MCP server#

acronym for Model Context Protocol server
external server that exposes capabilities to an agent - tools, resources, and/or prompts
allows agents to interact with databases, APIs, cloud services, or any custom system the server is built to access
facilitates portable behavior across agent platforms because implementation is stored in the server rather than the harness

memory#

in agent context - ability to store and retrieve information across interactions and tasks
enables agents to maintain context, learn from experience, and reference past actions
critical for multi-step reasoning and adapting behavior based on history
types include short-term - current task, long-term - across sessions, episodic - specific events

model checking#

process of evaluating whether statistical model assumptions are satisfied by the data
includes diagnostic tests for fit, examining residuals, and testing additional model terms
identifies violations that could invalidate statistical inferences
itself relies on further assumptions that become part of the full model

N#

natural AI experiment#

test type that features AI in environments where it is actually used - platforms, workplaces, real services
often A/B tests run by organizations to improve products or operations
advantages: highest naturalness, directly applicable findings
disadvantages: low feasibility, hard to replicate, narrow scope, limited control

non-parametric methods#

statistical techniques that make fewer assumptions about data distribution than parametric methods
somewhat misleading - these methods are not assumption-free
don’t assume data follows specific distribution, such as normal distribution
still require assumptions such as random sampling or randomization

null hypothesis#

serves as a baseline for testing - premise proposing zero effect or no relationship between variables
tested to determine if observed data are unusual enough to reject the hypothesis
random chance vs true effect - failure to reject doesn’t prove the null is true, only that data are compatible with it
example: treatment makes no difference in average outcome compared to control

O#

OLS regression#

abbreviation for Ordinary Least Squares regression
statistical method that estimates relationships between variables by minimizing squared differences
finds the best-fitting line through data points
used in AI testing to build simple prediction models based on historical data

one-sided hypothesis#

also known as dividing hypothesis
test premise about whether an effect is greater than or less than a specific value
differs from two-sided tests that check if effect differs in either direction
example: testing whether new treatment is at least as good as standard treatment

P#

permission and safety systems#

platform-level rules that define what actions an agent is allowed to take
conceptual authorization and/or guardrails
shape agent behavior independently of the underlying model
examples: requiring confirmation before running shell commands, restricting file access to specific directories, blocking certain categories of action entirely

PII#

abbreviation for Personally Identifiable Information
any data that could identify a specific individual
requires special handling for privacy and security compliance
examples: Social Security numbers, addresses, dates of birth, biometric data

planning#

fundamental building block for autonomous task execution
agent capability to decompose complex goals into sequences of executable actions
involves reasoning about future states, choosing strategies, and organizing steps
ranges from basic linear plans to complex multi-step reasoning with contingencies

power#

probability that a statistical test will reject the test hypothesis when a specific alternative is correct
calculated before study to determine adequate sample size
typically designed for 80% power: test will detect true effect 80% of the time
doesn’t measure compatibility of alternative hypothesis with observed data
shouldn’t be used to interpret results after data collection

prediction model#

algorithm or statistical model that forecasts outcomes based on input data
learns patterns from training data to make predictions about new cases
ranges from basic regression models to complex neural networks
accuracy depends on data quality, feature selection, and algorithm sophistication

probability#

in frequentist statistics: refers to hypothetical frequencies of data patterns under assumed model
often confused with hypothesis probability, leading to common statistical misinterpretations
doesn’t refer to probability of hypotheses being true or false
measured over many repetitions of same procedure under identical conditions

prompt#

input text or instructions given to an AI model to guide its response
quality and specificity of prompts significantly affect output quality
distinct from traditional search queries or commands
related term: system prompt

proxy test#

indirect measure used to evaluate something difficult to assess directly
substitutes an observable indicator for an unmeasurable or impractical characteristic
trade-off: easier to apply but may occasionally misclassify
example: using “developed exclusively for research” as a proxy for AI sophistication

P value#

probability that observed data, or more extreme, would occur if all model assumptions including test hypothesis were correct
ranges from 0 - complete incompatibility, to 1 - perfect compatibility
measures fit between data and entire statistical model, not just the hypothesis being tested
commonly misinterpreted; doesn’t indicate probability that hypothesis is true or false
often degraded into “significant” (P ≤ 0.05) vs “insignificant” dichotomy

Q#

qualitative research#

produces insights about “why” and “how” rather than “how many”
method focused on understanding meaning, experiences, and context through non-numerical data
collects data through interviews, observations, open-ended surveys, and document analysis

quasinatural AI experiment#

test type that combines naturalness of real AI systems with feasibility of lab experiments
advantages: naturalistic AI, broad research scope, easier data collection than natural experiments
disadvantages: researchers give up some control over algorithm construction
examples: testing commercial chatbots in controlled studies, pilot experiments before product launch

R#

RLHF#

acronym for reinforcement learning from human feedback
training methodology in which human evaluators rate model outputs and ratings fine-tune the model toward preferred behaviors
creates a strong instruction-following bias
models trained with RLHF tend to prioritize explicit user instructions, sometimes at the expense of broader context
related terms: compliance, sycophancy

robustness#

system’s ability to maintain performance under varying or adverse conditions
critical for deployment in real-world, unpredictable environments
evaluated through stress testing, edge cases, and challenging scenarios
in agent context - handling unexpected inputs, recovering from errors, adapting to environment changes

rule#

prompt instruction an LLM interprets and applies at its own discretion
has implicit opt-out path; model can rationalize skipping
constrasts with gates, which block progression until a condition is met
different than hooks, which fire deterministically from harness regardless of LLM interpretation

S#

self-reflection#

agent capability to evaluate its own reasoning, actions, and outputs
involves identifying errors, assessing performance, and adjusting strategy
enables learning from mistakes and iterative improvement without external feedback
distinguishes more sophisticated agents from basic reactive systems

scalar#

mathematical concept, specifically from linear algebra
element of a field which is used to define a vector space through the operation of scalar multiplication
“scalar value” may refer to a single numerical quantity that has magnitude but no direction

slash command#

direct command typed into a chat interface, /compact, /init
triggers specific agent behavior without going through the LLM’s interpretation loop
more predictable and consistent than natural language prompts for actions that need to happen reliably
contrasts with prompts, which the LLM interprets and may execute differently across runs

spec#

abbreviation for specification
implementation guide
informs everyone building on a format exactly what to expect: which fields exist, what values are valid, how files should be structured, what behavior is required vs optional

statistical inference#

foundational methodology for evaluating whether observed results are meaningful or due to chance
process of drawing conclusions about populations or processes from sample data
includes hypothesis testing, confidence interval estimation, and parameter estimation
accounts for uncertainty and random variation when making generalizations

statistical model#

mathematical representation of data variability and all assumptions used to compute statistics
includes assumptions about - data collection, randomization, treatment allocation, analysis choices
embodies full web of assumptions beyond just equations with parameters
violation of any assumption, not just test hypothesis, can produce misleading P values
often presented in compressed form, with many assumptions unstated or unrecognized

Stepwise Evaluation#

evaluation methodology that assesses agent performance at each individual step of task execution
examines correctness of intermediate actions, decisions, and reasoning at granular level
enables debugging and improvement of specific reasoning or action-taking capabilities
more resource-intensive than final response evaluation but provides richer diagnostic information
advantages: identifies exactly where agent succeeds or fails in multi-step processes

stochastic#

commonly used in mathematics, science, and information theory
random probability distribution or pattern that may be analyzed statistically, but may not be predicted precisely

stylized AI experiment#

test type conducted in a controlled environment, since the AI typically doesn’t exist outside the study
AI tailored to a research question: rule-based algorithms, historical data replication, or reinforcement learning
advantages: tight control over algorithm features, feasible and replicable, broad scope
disadvantages: lower naturalness compared to real-world AI systems

sycophancy#

known limitation of RLHF-trained models, active area of research
tendency in LLMs to agree with, validate, or comply with user input rather than reasoning independently
amplified by detailed or specific prompts, which push the model into “execution mode”

synthesis#

critical step between data collection and decision-making
process of combining multiple research findings or data points into coherent insights
transforms raw observations into patterns, themes, and actionable conclusions

system prompt#

set of instructions provided to the LLM by the platform before any user interaction begins
sits at the beginning of the context window, giving it strong positional attention weight
typically not visible to the user, but profoundly shapes the agent’s personality, default behaviors, and constraints

T#

taxonomy#

classification system that organizes concepts, objects, or phenomena into hierarchical categories
defines relationships between categories and provides structure to a domain
helps unify fragmented literature and reveal underexplored questions
in AI research - frameworks for organizing types of experiments, algorithms, or agent behaviors

temperature#

parameter that controls the degree of randomness in an LLM’s outputs
set by the platform and sometimes adjustable by the user
affects agent behavior independently of the model itself
low temperature produces more focused, predictable responses
high temperature produces more varied, creative ones

tool use#

also known as function calling or API calling
agent capability to interact with external functions, APIs, or resources to accomplish tasks
essential for extending agent capabilities beyond pure language generation
examples: executing code, querying databases, accessing web services, controlling software

training data#

dataset used to teach an AI model patterns, relationships, and knowledge
model learns by processing examples and adjusting internal parameters
quality and composition of training data directly affects model capabilities and biases

Trajectory-Based Assessment#

evaluation methodology that analyzes the complete path or sequence of actions an agent takes
examines entire decision-making process from initial state to final outcome
considers not just correctness but efficiency, reasoning quality, and recovery from errors
provides holistic view of agent behavior including planning, adaptation, and tool use patterns
enables evaluation of process quality, not just outcome quality

turn#

single exchange in a conversation: one user message and one assistant message
agent considers the full turn history when generating a response
related terms: user message, assistant message

U#

uncertainty quantification#

process of measuring and characterizing uncertainty in predictions, decisions, or model outputs
distinguishes between aleatoric uncertainty - inherent randomness, and epistemic uncertainty - lack of knowledge
enables AI systems to express confidence levels and identify when additional data or validation is needed
critical for safe deployment in high-stakes domains like healthcare, autonomous systems, and decision support
common methods - Bayesian inference, ensemble approaches, and Monte Carlo techniques

user message#

input sent by a human or automated system to an agent during a conversational turn
interpreted by the LLM rather than executed as a direct command
receives strong positional attention as the most recent content in the context window

V#

vignette study#

research method presenting hypothetical scenarios to elicit preferences or judgments
participants read descriptions of situations and state what they would do
common in conceptual AI experiments studying ethical dilemmas or preference patterns
advantages: can model any situation without implementation constraints, easy to scale
disadvantages: responses may not reflect actual behavior, lower external validity

VOC#

abbreviation for voice of the client
invaluable for service and produce improvement
data where people often share problems they’re encountering, provide feedback, and seek further help