Quick Reference#
Brief definitions for key terms in the Agent Ecosystem
A#
abstraction#
- label and/or concept that bundles together a set of underlying components or capabilities
- streamlines communication by hiding implementation details
- understanding what an abstraction hides is often necessary for diagnosing unexpected behavior
- “agent” is an abstraction for a collection of distinct parts
A/B test#
- also known as split testing and/or randomized controlled trial
- commonly used by tech companies to test features, interfaces, or algorithms
- experimental method that compares two versions of something to determine which performs better
- randomly assigns participants to treatment or control groups
affinity mapping#
- commonly used in UX research and design thinking to synthesize findings
- qualitative research method for organizing and grouping related ideas or observations
- participants sort data points - notes, quotes, themes - into clusters based on natural relationships
agent#
- autonomous system that perceives environment, makes decisions, and takes actions to achieve goals
- typically LLM-based system that can use tools, maintain memory, and execute multi-step tasks
- capabilities include reasoning, planning, tool use, memory management, and interaction
- distinct from chatbots through autonomy and task execution abilities
Agent-as-a-Judge#
- evaluation methodology where an AI agent assesses the performance of other agents
- agent evaluator examines outputs, behaviors, or decision-making processes
- enables scalable evaluation compared to human-only assessment
- related to LLM-as-a-Judge, but focuses on agent-level evaluation rather than just text outputs
agent skill#
- bundle of instructions and reference material that gives an agent just-in-time context for a specific domain or task
- distinct from hooks and slash commands in that skills work through LLM interpretation rather than deterministic execution
- subject to the same context window attention dynamics as other injected content
assistant message#
- output generated by an agent or AI model during a conversational turn
- paired with user messages to form the back-and-forth history the agent uses as context
- related terms: turn, user message
automation#
- use of technology to perform tasks with minimal human intervention
- can range from basic rule-based systems to complex machine learning models
- in AI context - delegation of decision-making or execution to algorithms, robots, or automated agents
B#
benchmark#
- standardized test or dataset used to evaluate and compare system performance
- provides consistent metrics across different models, agents, or approaches
- task completion rates, accuracy scores, reasoning capabilities
- enables objective comparison and tracks progress in the field
C#
canary phrase#
- named after canaries used in coal mines as early warning detectors
- unique marker string embedded in content to verify its presence in a system
- its appearance in output confirms that specific content was loaded and/or processed
codebook#
- structured guide used in qualitative research to categorize and tag data consistently
- defines categories, codes, and rules for applying them to text or observations
- ensures systematic analysis across multiple researchers or datasets
Cohen’s kappa coefficient#
- statistical measure of inter-rater agreement for categorical items
- measures the level of agreement between two raters while accounting for chance agreement
conceptual AI experiment#
- test type in which AI exists as a label or framing device, but no AI is actually implemented
- typically uses vignettes or scenarios to model operational principles or consequences of AI
- advantages: high feasibility, easy to scale and replicate, can study impractical or impossible scenarios
- disadvantages: lower naturalness since subjects don’t interact with actual AI
confidence interval#
- range of values likely to contain the true effect size, given the statistical model assumptions
- commonly reported as 95% confidence interval - if computed repeatedly under valid conditions, 95% will contain the true value
- width indicates precision of estimate; narrower intervals mean more precise estimates
- not to be confused with “95% probability the true value is in this range” for any single interval
context window#
- total amount of text, measured in tokens, an LLM can process at once
- includes system prompt, conversation history, and any injected context
- information outside the context window isn’t directly available to the model during a given interaction
context window management#
- agent platform strategies to handle conversations that approach or exceed the context window limit
- determine what gets retained, compressed, or dropped as conversations grow long
- common approaches include summarization and/or selective truncation of earlier messages
- quality of strategy affects whether an agent may “forget” earlier instructions
controlled vs natural#
- experimental design distinction based on environment
- trade-off between control/replicability and external validity/generalizability
- controlled: experiments conducted in artificial settings - labs, online platforms - where researchers manipulate variables
- natural: experiments conducted in real-world settings where AI is actually used - workplaces, platforms, markets
cost-efficiency#
- evaluation metric measuring computational resources required relative to task performance
- factors include token usage, API calls, processing time, energy consumption
- increasingly important as agents scale to production environments
- trade-off: higher accuracy often requires higher costs
D#
dissemination#
- systematic sharing of research findings with target audiences beyond the research team
- ensures knowledge can advance the field, change practice and policy, or inform future research
- requires planning for audience, timing, and appropriate communication channels
- methods include journal publications, conference presentations, social media, press releases, websites
E#
EDD#
- acronym for Evaluation-driven Development
- software development methodology where evaluation guides design and iteration
- incorporates continuous assessment of agent capabilities, reliability, and safety
- testing and metrics inform architectural decisions throughout development lifecycle
- emphasizes measurable outcomes and systematic improvement
edge case#
- critical for testing AI reliability and robustness
- scenario or condition that occurs at extreme operating parameters or unusual circumstances
- falls outside normal operating conditions but within specified boundaries
- examples: unusual inputs, rare combinations of factors, boundary conditions
empirical testing#
- validation approach based on observation and experimentation rather than theory alone
- uses real data and measurable outcomes to evaluate hypotheses
- applies algorithms with actual users, tasks, or environments to measure performance
experimental design#
- systematic planning of how to conduct an experiment to answer a research question
- goal is to isolate causal effects while minimizing confounding factors
- defines variables, treatments, control conditions, randomization, and measurement approach
- includes decisions about sample size, data collection methods, and analysis approach
F#
Final Response Evaluation#
- evaluation methodology that assesses only the end result or output of an agent’s execution
- judges success based on whether final answer or outcome is correct
- advantages: simple to implement, clear success criteria
- limitations: provides no insight into reasoning process, intermediate steps, or failure points
Flesch–Kincaid readability tests#
- designed to indicate how difficult a passage in English is to understand
- score reflects the U.S. grade level needed to comprehend the text
G#
gate#
- prompt condition that must be satisfied before work sequence continues
- provides objectively evaluable agentic checkpoints: thing happens → condition → then proceed
- different than hooks, which are triggered by events in the harness
- contrasts with rules, in which LLMs can interpret, bypass, rationalize around
Goodhart’s law#
- originally an economics principle, now widely applied to AI and agent systems
- “when a measure becomes a target, it ceases to be a good measure”
- describes phenomenon where optimizing for a proxy metric leads to gaming the metric rather than improving underlying quality
- critical concern: agents may learn to maximize benchmark scores without developing genuine capabilities
- examples: reward hacking, benchmark overfitting, specification gaming
Gunning fog index#
- readability test that estimates the years of formal education needed to understand text on first reading
- score of 12 indicates high school senior level
H#
hallucination#
- critical quality control concern in AI systems
- type of AI output that’s false, fabricated, or unsupported information
- appears plausible, but isn’t grounded in training data or provided context
harness#
- platform layer that wraps around an LLM
- provides configuration, permission settings, system prompts, tools
- may include code search, file operations, shell execution, web access, content management strategy, and temperature settings
- agents using the same underlying model can behave very differently depending on their harness
heuristic#
- practical problem-solving approach that uses shortcuts or rules of thumb to find satisfactory solutions
- differs from algorithms that guarantee optimal solutions
- trades optimality for speed and feasibility when exhaustive search is impractical
- in agent systems, guide decision-making when perfect information or unlimited computation is unavailable
- examples: A* search, greedy algorithms, hand-crafted evaluation functions
hook#
- script or callback that runs automatically in response to a specific event in the agent’s environment
- fires deterministically based on triggers, such as a file being edited, without going through the LLM’s interpretation loop
- useful for enforcing constraints reliably without relying on the agent to remember to do them
human-in-the-loop#
- system design where humans actively participate in AI decision-making or evaluation process
- human provides feedback, validation, or intervention at critical points
- balances automation with human judgment and oversight
- common in agent evaluation to assess quality, safety, and alignment with human values
L#
LLM#
- abbreviation for Large Language Model
- also known as “the agent’s brain”
- AI model trained on vast amounts of text data to understand and generate human language
- not all AI is LLM-based - such as computer vision models, recommendation systems
- examples: GPT - Generative Pre-trained Transformer, Claude, and Llama
LLM-as-a-Judge#
- evaluation methodology where a large language model assesses quality of text outputs
- LLM scores or ranks responses based on criteria like accuracy, helpfulness, or safety
- enables scalable evaluation compared to human annotation alone
- limitations include potential biases and consistency issues in LLM judgments
M#
MCP server#
- acronym for Model Context Protocol server
- external server that exposes capabilities to an agent - tools, resources, and/or prompts
- allows agents to interact with databases, APIs, cloud services, or any custom system the server is built to access
- facilitates portable behavior across agent platforms because implementation is stored in the server rather than the harness
memory#
- in agent context - ability to store and retrieve information across interactions and tasks
- enables agents to maintain context, learn from experience, and reference past actions
- critical for multi-step reasoning and adapting behavior based on history
- types include short-term - current task, long-term - across sessions, episodic - specific events
model checking#
- process of evaluating whether statistical model assumptions are satisfied by the data
- includes diagnostic tests for fit, examining residuals, and testing additional model terms
- identifies violations that could invalidate statistical inferences
- itself relies on further assumptions that become part of the full model
N#
natural AI experiment#
- test type that features AI in environments where it is actually used - platforms, workplaces, real services
- often A/B tests run by organizations to improve products or operations
- advantages: highest naturalness, directly applicable findings
- disadvantages: low feasibility, hard to replicate, narrow scope, limited control
non-parametric methods#
- statistical techniques that make fewer assumptions about data distribution than parametric methods
- somewhat misleading - these methods are not assumption-free
- don’t assume data follows specific distribution, such as normal distribution
- still require assumptions such as random sampling or randomization
null hypothesis#
- serves as a baseline for testing - premise proposing zero effect or no relationship between variables
- tested to determine if observed data are unusual enough to reject the hypothesis
- random chance vs true effect - failure to reject doesn’t prove the null is true, only that data are compatible with it
- example: treatment makes no difference in average outcome compared to control
O#
OLS regression#
- abbreviation for Ordinary Least Squares regression
- statistical method that estimates relationships between variables by minimizing squared differences
- finds the best-fitting line through data points
- used in AI testing to build simple prediction models based on historical data
one-sided hypothesis#
- also known as dividing hypothesis
- test premise about whether an effect is greater than or less than a specific value
- differs from two-sided tests that check if effect differs in either direction
- example: testing whether new treatment is at least as good as standard treatment
P#
permission and safety systems#
- platform-level rules that define what actions an agent is allowed to take
- conceptual authorization and/or guardrails
- shape agent behavior independently of the underlying model
- examples: requiring confirmation before running shell commands, restricting file access to specific directories, blocking certain categories of action entirely
PII#
- abbreviation for Personally Identifiable Information
- any data that could identify a specific individual
- requires special handling for privacy and security compliance
- examples: Social Security numbers, addresses, dates of birth, biometric data
planning#
- fundamental building block for autonomous task execution
- agent capability to decompose complex goals into sequences of executable actions
- involves reasoning about future states, choosing strategies, and organizing steps
- ranges from basic linear plans to complex multi-step reasoning with contingencies
power#
- probability that a statistical test will reject the test hypothesis when a specific alternative is correct
- calculated before study to determine adequate sample size
- typically designed for 80% power: test will detect true effect 80% of the time
- doesn’t measure compatibility of alternative hypothesis with observed data
- shouldn’t be used to interpret results after data collection
prediction model#
- algorithm or statistical model that forecasts outcomes based on input data
- learns patterns from training data to make predictions about new cases
- ranges from basic regression models to complex neural networks
- accuracy depends on data quality, feature selection, and algorithm sophistication
probability#
- in frequentist statistics: refers to hypothetical frequencies of data patterns under assumed model
- often confused with hypothesis probability, leading to common statistical misinterpretations
- doesn’t refer to probability of hypotheses being true or false
- measured over many repetitions of same procedure under identical conditions
prompt#
- input text or instructions given to an AI model to guide its response
- quality and specificity of prompts significantly affect output quality
- distinct from traditional search queries or commands
- related term: system prompt
proxy test#
- indirect measure used to evaluate something difficult to assess directly
- substitutes an observable indicator for an unmeasurable or impractical characteristic
- trade-off: easier to apply but may occasionally misclassify
- example: using “developed exclusively for research” as a proxy for AI sophistication
P value#
- probability that observed data, or more extreme, would occur if all model assumptions including test hypothesis were correct
- ranges from 0 - complete incompatibility, to 1 - perfect compatibility
- measures fit between data and entire statistical model, not just the hypothesis being tested
- commonly misinterpreted; doesn’t indicate probability that hypothesis is true or false
- often degraded into “significant”
(P ≤ 0.05)vs “insignificant” dichotomy
Q#
qualitative research#
- produces insights about “why” and “how” rather than “how many”
- method focused on understanding meaning, experiences, and context through non-numerical data
- collects data through interviews, observations, open-ended surveys, and document analysis
quasinatural AI experiment#
- test type that combines naturalness of real AI systems with feasibility of lab experiments
- advantages: naturalistic AI, broad research scope, easier data collection than natural experiments
- disadvantages: researchers give up some control over algorithm construction
- examples: testing commercial chatbots in controlled studies, pilot experiments before product launch
R#
RLHF#
- acronym for reinforcement learning from human feedback
- training methodology in which human evaluators rate model outputs and ratings fine-tune the model toward preferred behaviors
- creates a strong instruction-following bias
- models trained with RLHF tend to prioritize explicit user instructions, sometimes at the expense of broader context
- related terms: compliance, sycophancy
robustness#
- system’s ability to maintain performance under varying or adverse conditions
- critical for deployment in real-world, unpredictable environments
- evaluated through stress testing, edge cases, and challenging scenarios
- in agent context - handling unexpected inputs, recovering from errors, adapting to environment changes
rule#
- prompt instruction an LLM interprets and applies at its own discretion
- has implicit opt-out path; model can rationalize skipping
- constrasts with gates, which block progression until a condition is met
- different than hooks, which fire deterministically from harness regardless of LLM interpretation
S#
self-reflection#
- agent capability to evaluate its own reasoning, actions, and outputs
- involves identifying errors, assessing performance, and adjusting strategy
- enables learning from mistakes and iterative improvement without external feedback
- distinguishes more sophisticated agents from basic reactive systems
scalar#
- mathematical concept, specifically from linear algebra
- element of a field which is used to define a vector space through the operation of scalar multiplication
- “scalar value” may refer to a single numerical quantity that has magnitude but no direction
slash command#
- direct command typed into a chat interface,
/compact,/init - triggers specific agent behavior without going through the LLM’s interpretation loop
- more predictable and consistent than natural language prompts for actions that need to happen reliably
- contrasts with prompts, which the LLM interprets and may execute differently across runs
spec#
- abbreviation for specification
- implementation guide
- informs everyone building on a format exactly what to expect: which fields exist, what values are valid, how files should be structured, what behavior is required vs optional
statistical inference#
- foundational methodology for evaluating whether observed results are meaningful or due to chance
- process of drawing conclusions about populations or processes from sample data
- includes hypothesis testing, confidence interval estimation, and parameter estimation
- accounts for uncertainty and random variation when making generalizations
statistical model#
- mathematical representation of data variability and all assumptions used to compute statistics
- includes assumptions about - data collection, randomization, treatment allocation, analysis choices
- embodies full web of assumptions beyond just equations with parameters
- violation of any assumption, not just test hypothesis, can produce misleading P values
- often presented in compressed form, with many assumptions unstated or unrecognized
Stepwise Evaluation#
- evaluation methodology that assesses agent performance at each individual step of task execution
- examines correctness of intermediate actions, decisions, and reasoning at granular level
- enables debugging and improvement of specific reasoning or action-taking capabilities
- more resource-intensive than final response evaluation but provides richer diagnostic information
- advantages: identifies exactly where agent succeeds or fails in multi-step processes
stochastic#
- commonly used in mathematics, science, and information theory
- random probability distribution or pattern that may be analyzed statistically, but may not be predicted precisely
stylized AI experiment#
- test type conducted in a controlled environment, since the AI typically doesn’t exist outside the study
- AI tailored to a research question: rule-based algorithms, historical data replication, or reinforcement learning
- advantages: tight control over algorithm features, feasible and replicable, broad scope
- disadvantages: lower naturalness compared to real-world AI systems
sycophancy#
- known limitation of RLHF-trained models, active area of research
- tendency in LLMs to agree with, validate, or comply with user input rather than reasoning independently
- amplified by detailed or specific prompts, which push the model into “execution mode”
synthesis#
- critical step between data collection and decision-making
- process of combining multiple research findings or data points into coherent insights
- transforms raw observations into patterns, themes, and actionable conclusions
system prompt#
- set of instructions provided to the LLM by the platform before any user interaction begins
- sits at the beginning of the context window, giving it strong positional attention weight
- typically not visible to the user, but profoundly shapes the agent’s personality, default behaviors, and constraints
T#
taxonomy#
- classification system that organizes concepts, objects, or phenomena into hierarchical categories
- defines relationships between categories and provides structure to a domain
- helps unify fragmented literature and reveal underexplored questions
- in AI research - frameworks for organizing types of experiments, algorithms, or agent behaviors
temperature#
- parameter that controls the degree of randomness in an LLM’s outputs
- set by the platform and sometimes adjustable by the user
- affects agent behavior independently of the model itself
- low temperature produces more focused, predictable responses
- high temperature produces more varied, creative ones
tool use#
- also known as function calling or API calling
- agent capability to interact with external functions, APIs, or resources to accomplish tasks
- essential for extending agent capabilities beyond pure language generation
- examples: executing code, querying databases, accessing web services, controlling software
training data#
- dataset used to teach an AI model patterns, relationships, and knowledge
- model learns by processing examples and adjusting internal parameters
- quality and composition of training data directly affects model capabilities and biases
Trajectory-Based Assessment#
- evaluation methodology that analyzes the complete path or sequence of actions an agent takes
- examines entire decision-making process from initial state to final outcome
- considers not just correctness but efficiency, reasoning quality, and recovery from errors
- provides holistic view of agent behavior including planning, adaptation, and tool use patterns
- enables evaluation of process quality, not just outcome quality
turn#
- single exchange in a conversation: one user message and one assistant message
- agent considers the full turn history when generating a response
- related terms: user message, assistant message
U#
uncertainty quantification#
- process of measuring and characterizing uncertainty in predictions, decisions, or model outputs
- distinguishes between aleatoric uncertainty - inherent randomness, and epistemic uncertainty - lack of knowledge
- enables AI systems to express confidence levels and identify when additional data or validation is needed
- critical for safe deployment in high-stakes domains like healthcare, autonomous systems, and decision support
- common methods - Bayesian inference, ensemble approaches, and Monte Carlo techniques
user message#
- input sent by a human or automated system to an agent during a conversational turn
- interpreted by the LLM rather than executed as a direct command
- receives strong positional attention as the most recent content in the context window
V#
vignette study#
- research method presenting hypothetical scenarios to elicit preferences or judgments
- participants read descriptions of situations and state what they would do
- common in conceptual AI experiments studying ethical dilemmas or preference patterns
- advantages: can model any situation without implementation constraints, easy to scale
- disadvantages: responses may not reflect actual behavior, lower external validity
VOC#
- abbreviation for voice of the client
- invaluable for service and produce improvement
- data where people often share problems they’re encountering, provide feedback, and seek further help