Quick Reference#

Brief definitions for key terms in the Agent Ecosystem


A#


abstraction#

  • label and/or concept that bundles together a set of underlying components or capabilities
  • streamlines communication by hiding implementation details
  • understanding what an abstraction hides is often necessary for diagnosing unexpected behavior
  • “agent” is an abstraction for a collection of distinct parts

A/B test#

  • also known as split testing and/or randomized controlled trial
  • commonly used by tech companies to test features, interfaces, or algorithms
  • experimental method that compares two versions of something to determine which performs better
  • randomly assigns participants to treatment or control groups

affinity mapping#

  • commonly used in UX research and design thinking to synthesize findings
  • qualitative research method for organizing and grouping related ideas or observations
  • participants sort data points - notes, quotes, themes - into clusters based on natural relationships

agent#

  • autonomous system that perceives environment, makes decisions, and takes actions to achieve goals
  • typically LLM-based system that can use tools, maintain memory, and execute multi-step tasks
  • capabilities include reasoning, planning, tool use, memory management, and interaction
  • distinct from chatbots through autonomy and task execution abilities

Agent-as-a-Judge#

  • evaluation methodology where an AI agent assesses the performance of other agents
  • agent evaluator examines outputs, behaviors, or decision-making processes
  • enables scalable evaluation compared to human-only assessment
  • related to LLM-as-a-Judge, but focuses on agent-level evaluation rather than just text outputs

agent skill#

  • bundle of instructions and reference material that gives an agent just-in-time context for a specific domain or task
  • distinct from hooks and slash commands in that skills work through LLM interpretation rather than deterministic execution
  • subject to the same context window attention dynamics as other injected content

assistant message#

  • output generated by an agent or AI model during a conversational turn
  • paired with user messages to form the back-and-forth history the agent uses as context
  • related terms: turn, user message

automation#

  • use of technology to perform tasks with minimal human intervention
  • can range from basic rule-based systems to complex machine learning models
  • in AI context - delegation of decision-making or execution to algorithms, robots, or automated agents

B#


benchmark#

  • standardized test or dataset used to evaluate and compare system performance
  • provides consistent metrics across different models, agents, or approaches
  • task completion rates, accuracy scores, reasoning capabilities
  • enables objective comparison and tracks progress in the field

C#


canary phrase#

  • named after canaries used in coal mines as early warning detectors
  • unique marker string embedded in content to verify its presence in a system
  • its appearance in output confirms that specific content was loaded and/or processed

codebook#

  • structured guide used in qualitative research to categorize and tag data consistently
  • defines categories, codes, and rules for applying them to text or observations
  • ensures systematic analysis across multiple researchers or datasets

Cohen’s kappa coefficient#

  • statistical measure of inter-rater agreement for categorical items
  • measures the level of agreement between two raters while accounting for chance agreement

conceptual AI experiment#

  • test type in which AI exists as a label or framing device, but no AI is actually implemented
  • typically uses vignettes or scenarios to model operational principles or consequences of AI
  • advantages: high feasibility, easy to scale and replicate, can study impractical or impossible scenarios
  • disadvantages: lower naturalness since subjects don’t interact with actual AI

confidence interval#

  • range of values likely to contain the true effect size, given the statistical model assumptions
  • commonly reported as 95% confidence interval - if computed repeatedly under valid conditions, 95% will contain the true value
  • width indicates precision of estimate; narrower intervals mean more precise estimates
  • not to be confused with “95% probability the true value is in this range” for any single interval

context window#

  • total amount of text, measured in tokens, an LLM can process at once
  • includes system prompt, conversation history, and any injected context
  • information outside the context window isn’t directly available to the model during a given interaction

context window management#

  • agent platform strategies to handle conversations that approach or exceed the context window limit
  • determine what gets retained, compressed, or dropped as conversations grow long
  • common approaches include summarization and/or selective truncation of earlier messages
  • quality of strategy affects whether an agent may “forget” earlier instructions

controlled vs natural#

  • experimental design distinction based on environment
  • trade-off between control/replicability and external validity/generalizability
  • controlled: experiments conducted in artificial settings - labs, online platforms - where researchers manipulate variables
  • natural: experiments conducted in real-world settings where AI is actually used - workplaces, platforms, markets

cost-efficiency#

  • evaluation metric measuring computational resources required relative to task performance
  • factors include token usage, API calls, processing time, energy consumption
  • increasingly important as agents scale to production environments
  • trade-off: higher accuracy often requires higher costs

D#


dissemination#

  • systematic sharing of research findings with target audiences beyond the research team
  • ensures knowledge can advance the field, change practice and policy, or inform future research
  • requires planning for audience, timing, and appropriate communication channels
  • methods include journal publications, conference presentations, social media, press releases, websites

E#


EDD#

  • acronym for Evaluation-driven Development
  • software development methodology where evaluation guides design and iteration
  • incorporates continuous assessment of agent capabilities, reliability, and safety
  • testing and metrics inform architectural decisions throughout development lifecycle
  • emphasizes measurable outcomes and systematic improvement

edge case#

  • critical for testing AI reliability and robustness
  • scenario or condition that occurs at extreme operating parameters or unusual circumstances
  • falls outside normal operating conditions but within specified boundaries
  • examples: unusual inputs, rare combinations of factors, boundary conditions

empirical testing#

  • validation approach based on observation and experimentation rather than theory alone
  • uses real data and measurable outcomes to evaluate hypotheses
  • applies algorithms with actual users, tasks, or environments to measure performance

experimental design#

  • systematic planning of how to conduct an experiment to answer a research question
  • goal is to isolate causal effects while minimizing confounding factors
  • defines variables, treatments, control conditions, randomization, and measurement approach
  • includes decisions about sample size, data collection methods, and analysis approach

F#


Final Response Evaluation#

  • evaluation methodology that assesses only the end result or output of an agent’s execution
  • judges success based on whether final answer or outcome is correct
  • advantages: simple to implement, clear success criteria
  • limitations: provides no insight into reasoning process, intermediate steps, or failure points

Flesch–Kincaid readability tests#

  • designed to indicate how difficult a passage in English is to understand
  • score reflects the U.S. grade level needed to comprehend the text

G#


gate#

  • prompt condition that must be satisfied before work sequence continues
  • provides objectively evaluable agentic checkpoints: thing happens → condition → then proceed
  • different than hooks, which are triggered by events in the harness
  • contrasts with rules, in which LLMs can interpret, bypass, rationalize around

Goodhart’s law#

  • originally an economics principle, now widely applied to AI and agent systems
  • “when a measure becomes a target, it ceases to be a good measure”
  • describes phenomenon where optimizing for a proxy metric leads to gaming the metric rather than improving underlying quality
  • critical concern: agents may learn to maximize benchmark scores without developing genuine capabilities
  • examples: reward hacking, benchmark overfitting, specification gaming

Gunning fog index#

  • readability test that estimates the years of formal education needed to understand text on first reading
  • score of 12 indicates high school senior level

H#


hallucination#

  • critical quality control concern in AI systems
  • type of AI output that’s false, fabricated, or unsupported information
  • appears plausible, but isn’t grounded in training data or provided context

harness#

  • platform layer that wraps around an LLM
  • provides configuration, permission settings, system prompts, tools
  • may include code search, file operations, shell execution, web access, content management strategy, and temperature settings
  • agents using the same underlying model can behave very differently depending on their harness

heuristic#

  • practical problem-solving approach that uses shortcuts or rules of thumb to find satisfactory solutions
  • differs from algorithms that guarantee optimal solutions
  • trades optimality for speed and feasibility when exhaustive search is impractical
  • in agent systems, guide decision-making when perfect information or unlimited computation is unavailable
  • examples: A* search, greedy algorithms, hand-crafted evaluation functions

hook#

  • script or callback that runs automatically in response to a specific event in the agent’s environment
  • fires deterministically based on triggers, such as a file being edited, without going through the LLM’s interpretation loop
  • useful for enforcing constraints reliably without relying on the agent to remember to do them

human-in-the-loop#

  • system design where humans actively participate in AI decision-making or evaluation process
  • human provides feedback, validation, or intervention at critical points
  • balances automation with human judgment and oversight
  • common in agent evaluation to assess quality, safety, and alignment with human values

L#


LLM#

  • abbreviation for Large Language Model
  • also known as “the agent’s brain”
  • AI model trained on vast amounts of text data to understand and generate human language
  • not all AI is LLM-based - such as computer vision models, recommendation systems
  • examples: GPT - Generative Pre-trained Transformer, Claude, and Llama

LLM-as-a-Judge#

  • evaluation methodology where a large language model assesses quality of text outputs
  • LLM scores or ranks responses based on criteria like accuracy, helpfulness, or safety
  • enables scalable evaluation compared to human annotation alone
  • limitations include potential biases and consistency issues in LLM judgments

M#


MCP server#

  • acronym for Model Context Protocol server
  • external server that exposes capabilities to an agent - tools, resources, and/or prompts
  • allows agents to interact with databases, APIs, cloud services, or any custom system the server is built to access
  • facilitates portable behavior across agent platforms because implementation is stored in the server rather than the harness

memory#

  • in agent context - ability to store and retrieve information across interactions and tasks
  • enables agents to maintain context, learn from experience, and reference past actions
  • critical for multi-step reasoning and adapting behavior based on history
  • types include short-term - current task, long-term - across sessions, episodic - specific events

model checking#

  • process of evaluating whether statistical model assumptions are satisfied by the data
  • includes diagnostic tests for fit, examining residuals, and testing additional model terms
  • identifies violations that could invalidate statistical inferences
  • itself relies on further assumptions that become part of the full model

N#


natural AI experiment#

  • test type that features AI in environments where it is actually used - platforms, workplaces, real services
  • often A/B tests run by organizations to improve products or operations
  • advantages: highest naturalness, directly applicable findings
  • disadvantages: low feasibility, hard to replicate, narrow scope, limited control

non-parametric methods#

  • statistical techniques that make fewer assumptions about data distribution than parametric methods
  • somewhat misleading - these methods are not assumption-free
  • don’t assume data follows specific distribution, such as normal distribution
  • still require assumptions such as random sampling or randomization

null hypothesis#

  • serves as a baseline for testing - premise proposing zero effect or no relationship between variables
  • tested to determine if observed data are unusual enough to reject the hypothesis
  • random chance vs true effect - failure to reject doesn’t prove the null is true, only that data are compatible with it
  • example: treatment makes no difference in average outcome compared to control

O#


OLS regression#

  • abbreviation for Ordinary Least Squares regression
  • statistical method that estimates relationships between variables by minimizing squared differences
  • finds the best-fitting line through data points
  • used in AI testing to build simple prediction models based on historical data

one-sided hypothesis#

  • also known as dividing hypothesis
  • test premise about whether an effect is greater than or less than a specific value
  • differs from two-sided tests that check if effect differs in either direction
  • example: testing whether new treatment is at least as good as standard treatment

P#


permission and safety systems#

  • platform-level rules that define what actions an agent is allowed to take
  • conceptual authorization and/or guardrails
  • shape agent behavior independently of the underlying model
  • examples: requiring confirmation before running shell commands, restricting file access to specific directories, blocking certain categories of action entirely

PII#

  • abbreviation for Personally Identifiable Information
  • any data that could identify a specific individual
  • requires special handling for privacy and security compliance
  • examples: Social Security numbers, addresses, dates of birth, biometric data

planning#

  • fundamental building block for autonomous task execution
  • agent capability to decompose complex goals into sequences of executable actions
  • involves reasoning about future states, choosing strategies, and organizing steps
  • ranges from basic linear plans to complex multi-step reasoning with contingencies

power#

  • probability that a statistical test will reject the test hypothesis when a specific alternative is correct
  • calculated before study to determine adequate sample size
  • typically designed for 80% power: test will detect true effect 80% of the time
  • doesn’t measure compatibility of alternative hypothesis with observed data
  • shouldn’t be used to interpret results after data collection

prediction model#

  • algorithm or statistical model that forecasts outcomes based on input data
  • learns patterns from training data to make predictions about new cases
  • ranges from basic regression models to complex neural networks
  • accuracy depends on data quality, feature selection, and algorithm sophistication

probability#

  • in frequentist statistics: refers to hypothetical frequencies of data patterns under assumed model
  • often confused with hypothesis probability, leading to common statistical misinterpretations
  • doesn’t refer to probability of hypotheses being true or false
  • measured over many repetitions of same procedure under identical conditions

prompt#

  • input text or instructions given to an AI model to guide its response
  • quality and specificity of prompts significantly affect output quality
  • distinct from traditional search queries or commands
  • related term: system prompt

proxy test#

  • indirect measure used to evaluate something difficult to assess directly
  • substitutes an observable indicator for an unmeasurable or impractical characteristic
  • trade-off: easier to apply but may occasionally misclassify
  • example: using “developed exclusively for research” as a proxy for AI sophistication

P value#

  • probability that observed data, or more extreme, would occur if all model assumptions including test hypothesis were correct
  • ranges from 0 - complete incompatibility, to 1 - perfect compatibility
  • measures fit between data and entire statistical model, not just the hypothesis being tested
  • commonly misinterpreted; doesn’t indicate probability that hypothesis is true or false
  • often degraded into “significant” (P ≤ 0.05) vs “insignificant” dichotomy

Q#


qualitative research#

  • produces insights about “why” and “how” rather than “how many”
  • method focused on understanding meaning, experiences, and context through non-numerical data
  • collects data through interviews, observations, open-ended surveys, and document analysis

quasinatural AI experiment#

  • test type that combines naturalness of real AI systems with feasibility of lab experiments
  • advantages: naturalistic AI, broad research scope, easier data collection than natural experiments
  • disadvantages: researchers give up some control over algorithm construction
  • examples: testing commercial chatbots in controlled studies, pilot experiments before product launch

R#


RLHF#

  • acronym for reinforcement learning from human feedback
  • training methodology in which human evaluators rate model outputs and ratings fine-tune the model toward preferred behaviors
  • creates a strong instruction-following bias
  • models trained with RLHF tend to prioritize explicit user instructions, sometimes at the expense of broader context
  • related terms: compliance, sycophancy

robustness#

  • system’s ability to maintain performance under varying or adverse conditions
  • critical for deployment in real-world, unpredictable environments
  • evaluated through stress testing, edge cases, and challenging scenarios
  • in agent context - handling unexpected inputs, recovering from errors, adapting to environment changes

rule#

  • prompt instruction an LLM interprets and applies at its own discretion
  • has implicit opt-out path; model can rationalize skipping
  • constrasts with gates, which block progression until a condition is met
  • different than hooks, which fire deterministically from harness regardless of LLM interpretation

S#


self-reflection#

  • agent capability to evaluate its own reasoning, actions, and outputs
  • involves identifying errors, assessing performance, and adjusting strategy
  • enables learning from mistakes and iterative improvement without external feedback
  • distinguishes more sophisticated agents from basic reactive systems

scalar#

  • mathematical concept, specifically from linear algebra
  • element of a field which is used to define a vector space through the operation of scalar multiplication
  • “scalar value” may refer to a single numerical quantity that has magnitude but no direction

slash command#

  • direct command typed into a chat interface, /compact, /init
  • triggers specific agent behavior without going through the LLM’s interpretation loop
  • more predictable and consistent than natural language prompts for actions that need to happen reliably
  • contrasts with prompts, which the LLM interprets and may execute differently across runs

spec#

  • abbreviation for specification
  • implementation guide
  • informs everyone building on a format exactly what to expect: which fields exist, what values are valid, how files should be structured, what behavior is required vs optional

statistical inference#

  • foundational methodology for evaluating whether observed results are meaningful or due to chance
  • process of drawing conclusions about populations or processes from sample data
  • includes hypothesis testing, confidence interval estimation, and parameter estimation
  • accounts for uncertainty and random variation when making generalizations

statistical model#

  • mathematical representation of data variability and all assumptions used to compute statistics
  • includes assumptions about - data collection, randomization, treatment allocation, analysis choices
  • embodies full web of assumptions beyond just equations with parameters
  • violation of any assumption, not just test hypothesis, can produce misleading P values
  • often presented in compressed form, with many assumptions unstated or unrecognized

Stepwise Evaluation#

  • evaluation methodology that assesses agent performance at each individual step of task execution
  • examines correctness of intermediate actions, decisions, and reasoning at granular level
  • enables debugging and improvement of specific reasoning or action-taking capabilities
  • more resource-intensive than final response evaluation but provides richer diagnostic information
  • advantages: identifies exactly where agent succeeds or fails in multi-step processes

stochastic#

  • commonly used in mathematics, science, and information theory
  • random probability distribution or pattern that may be analyzed statistically, but may not be predicted precisely

stylized AI experiment#

  • test type conducted in a controlled environment, since the AI typically doesn’t exist outside the study
  • AI tailored to a research question: rule-based algorithms, historical data replication, or reinforcement learning
  • advantages: tight control over algorithm features, feasible and replicable, broad scope
  • disadvantages: lower naturalness compared to real-world AI systems

sycophancy#

  • known limitation of RLHF-trained models, active area of research
  • tendency in LLMs to agree with, validate, or comply with user input rather than reasoning independently
  • amplified by detailed or specific prompts, which push the model into “execution mode”

synthesis#

  • critical step between data collection and decision-making
  • process of combining multiple research findings or data points into coherent insights
  • transforms raw observations into patterns, themes, and actionable conclusions

system prompt#

  • set of instructions provided to the LLM by the platform before any user interaction begins
  • sits at the beginning of the context window, giving it strong positional attention weight
  • typically not visible to the user, but profoundly shapes the agent’s personality, default behaviors, and constraints

T#


taxonomy#

  • classification system that organizes concepts, objects, or phenomena into hierarchical categories
  • defines relationships between categories and provides structure to a domain
  • helps unify fragmented literature and reveal underexplored questions
  • in AI research - frameworks for organizing types of experiments, algorithms, or agent behaviors

temperature#

  • parameter that controls the degree of randomness in an LLM’s outputs
  • set by the platform and sometimes adjustable by the user
  • affects agent behavior independently of the model itself
  • low temperature produces more focused, predictable responses
  • high temperature produces more varied, creative ones

tool use#

  • also known as function calling or API calling
  • agent capability to interact with external functions, APIs, or resources to accomplish tasks
  • essential for extending agent capabilities beyond pure language generation
  • examples: executing code, querying databases, accessing web services, controlling software

training data#

  • dataset used to teach an AI model patterns, relationships, and knowledge
  • model learns by processing examples and adjusting internal parameters
  • quality and composition of training data directly affects model capabilities and biases

Trajectory-Based Assessment#

  • evaluation methodology that analyzes the complete path or sequence of actions an agent takes
  • examines entire decision-making process from initial state to final outcome
  • considers not just correctness but efficiency, reasoning quality, and recovery from errors
  • provides holistic view of agent behavior including planning, adaptation, and tool use patterns
  • enables evaluation of process quality, not just outcome quality

turn#

  • single exchange in a conversation: one user message and one assistant message
  • agent considers the full turn history when generating a response
  • related terms: user message, assistant message

U#


uncertainty quantification#

  • process of measuring and characterizing uncertainty in predictions, decisions, or model outputs
  • distinguishes between aleatoric uncertainty - inherent randomness, and epistemic uncertainty - lack of knowledge
  • enables AI systems to express confidence levels and identify when additional data or validation is needed
  • critical for safe deployment in high-stakes domains like healthcare, autonomous systems, and decision support
  • common methods - Bayesian inference, ensemble approaches, and Monte Carlo techniques

user message#

  • input sent by a human or automated system to an agent during a conversational turn
  • interpreted by the LLM rather than executed as a direct command
  • receives strong positional attention as the most recent content in the context window

V#


vignette study#

  • research method presenting hypothetical scenarios to elicit preferences or judgments
  • participants read descriptions of situations and state what they would do
  • common in conceptual AI experiments studying ethical dilemmas or preference patterns
  • advantages: can model any situation without implementation constraints, easy to scale
  • disadvantages: responses may not reflect actual behavior, lower external validity

VOC#

  • abbreviation for voice of the client
  • invaluable for service and produce improvement
  • data where people often share problems they’re encountering, provide feedback, and seek further help