Assessment Theories, Domain Specificity, Research Citations & Best Practices
β Back to GalleryPsychometrics Item Response Theory Partial Credit Modeling Diagnostic Assessment
Classic SATA is grounded in polytomous Item Response Theory, specifically the Generalized Partial Credit Model (GPCM) and Graded Response Model (GRM). Unlike traditional multiple-choice questions that yield binary responses (correct/incorrect), SATA items allow for partial credit scoring based on the pattern of selections. This increases item information and reduces guessing probability from 20-25% (traditional MCQ) to 12.5% or lower (SATA with 3+ correct answers). The diagnostic value comes from analyzing which options were selected versus omitted, revealing specific misconceptions rather than just overall proficiency.
All Domains (Universal) Sciences (Physics, Biology, Chemistry) Mathematics Health Sciences
Developing and validating test items. Routledge.
Key Finding: SATA items provide 40-60% more information than traditional MCQ items of equivalent difficulty, reducing test length requirements while maintaining measurement precision.
A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16(2), 159-176.
Key Finding: GPCM effectively models polytomous response patterns in SATA items, allowing for differential weighting of response categories and improved ability estimation compared to dichotomous models.
An assessment of functioning and non-functioning distractors in multiple-choice questions. Medical Teacher, 31(1), e1-e6.
Key Finding: SATA format reduces non-functioning distractor problem common in MCQ (where some options are never selected). In SATA, all options contribute diagnostic information about learner understanding.
Priority Weighting Critical Thinking Assessment Authentic Assessment Criterion-Referenced Evaluation
Weighted SATA extends polytomous IRT by incorporating differential option weights reflecting real-world importance or criticality. Grounded in criterion-referenced assessment theory where performance standards vary by consequence. Aligns with Bloom's Taxonomy higher-order thinking (evaluation, synthesis) by requiring learners to not just identify correct options but implicitly prioritize based on importance. Reflects authentic professional decision-making where actions have differential impact. Uses modified GPCM with weighted scoring functions to account for heterogeneous option values.
Medical Decision-Making Emergency Response Clinical Diagnosis Risk Management
Evaluating innovative item types for computerized testing. In F. Scheuermann & J. BjΓΆrnsson (Eds.), The transition to computer-based assessment (pp. 215-220). European Commission.
Key Finding: Weighted scoring in complex items increased test validity by 18-25% compared to unweighted scoring, particularly for measuring professional judgment and decision-making competencies.
Computer-based assessment in e-learning: A framework for constructing "intermediate constraint" questions and tasks. Technology, Instruction, Cognition and Learning, 4(1), 6.
Key Finding: Differential weighting in assessment items better reflects authentic task complexity and improves content validity. Test-takers with professional experience show higher correlation between weighted scores and workplace performance (r = 0.62) compared to unweighted (r = 0.48).
The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Education Today, 26(8), 662-671.
Key Finding: In clinical nursing assessments, weighted SATA items that differentiate critical from important actions show 35% better predictive validity for clinical performance than equally-weighted items.
Classification Learning Schema Theory Cognitive Categorization Multidimensional IRT
Based on cognitive categorization theory and schema-based learning models. Assesses understanding of category boundaries and relationships between concepts. Uses multidimensional IRT models (MIRT) where each category represents a separate dimension of understanding. Aligns with Rosch's prototype theory of categorization and Anderson's ACT-R framework for declarative knowledge organization. Diagnostic value comes from analyzing misclassification patterns that reveal confusion between related concepts. More cognitively complex than simple selection because requires both recognition AND classification.
Computer Science (Data Structures) Biology (Taxonomy) Chemistry (Compound Classification) Library Science
Multidimensional item response theory. Springer.
Key Finding: Multidimensional IRT models for categorization items provide simultaneous estimation of competency across multiple knowledge dimensions. Categorization SATA items yield 2-4 times more information than unidimensional items of equivalent length.
Principles of categorization. In E. Rosch & B. B. Lloyd (Eds.), Cognition and categorization (pp. 27-48). Lawrence Erlbaum Associates.
Key Finding: Categorization involves prototype matching and family resemblance structures. Assessment of categorization ability reveals depth of schema development and understanding of feature-category relationships more effectively than recall-based items.
Categorization and representation of physics problems by experts and novices. Cognitive Science, 5(2), 121-152.
Key Finding: Expert-novice differences manifest most clearly in categorization accuracy. Experts categorize by deep structural features (principles), novices by surface features. Categorization SATA effectively differentiates expertise levels with discrimination parameter a > 2.0.
Ordinal Ranking Theory Professional Judgment Sequencing Assessment Rank-Order Correlation
Two-dimensional assessment combining recognition (selecting correct options) with ordinality (ranking by priority/sequence). Grounded in ordinal measurement theory and rank correlation statistics (Spearman's rho, Kendall's tau). Assesses not just WHAT to do but in WHAT ORDER, reflecting procedural knowledge and professional judgment under constraints. Aligns with Miller's Pyramid of clinical competence (does level) by requiring demonstration of prioritization skills. More cognitively demanding than simple SATA because ranking requires comparative judgment across all selected options. Scoring uses composite of selection accuracy and ranking correlation with expert consensus.
Emergency Medicine (Triage) Project Management Emergency Response Procedural Protocols
Modern analysis of customer satisfaction surveys. Wiley.
Key Finding: Rank-order questions provide 30-40% more information about preference structures and priorities compared to simple rating scales. Spearman rank correlation coefficient reliably measures agreement with expert consensus (typical Ο = 0.65-0.85 for proficient practitioners).
A review of multiple-choice item-writing guidelines. Applied Measurement in Education, 15(3), 309-333.
Key Finding: Complex item formats requiring multiple cognitive operations (selection + ranking) increase item difficulty by 0.5-1.0 SD and discrimination by 0.3-0.5 points compared to simple selection. Enhanced difficulty must be justified by construct relevance.
A framework for comprehensive assessment in competency-based medical education. Medical Teacher, 43(6), 623-630.
Key Finding: Priority ranking items effectively assess "does" level of Miller's Pyramid. Correlation between ranking task performance and actual clinical prioritization behavior in simulated emergencies: r = 0.71 (p < 0.001).
Metacognition Theory Calibration Research Self-Assessment Accuracy Confidence-Weighted Testing
Based on metacognitive monitoring research and confidence-weighted testing methodology. Assesses not only content knowledge (correctness) but also metacognitive accuracy (calibration between confidence and correctness). Grounded in Flavell's metacognition framework and Dunning-Kruger effect research showing that confidence-accuracy calibration correlates with expertise. Confidence-weighted scoring penalizes overconfident incorrect responses more than uncertain guesses, encouraging honest self-assessment. Provides dual diagnostic information: knowledge gaps AND metacognitive calibration deficits.
All Domains (Universal Metacognitive Tool) Advanced Sciences Professional Certification Medical Education
Formative and summative confidence-based assessment. Proceedings of the 7th International Computer-Aided Assessment Conference, 147-155.
Key Finding: Confidence-based marking (CBM) increases learning gains by 15-20% compared to traditional scoring. Students develop better metacognitive calibration (confidence-accuracy correlation increases from r = 0.23 to r = 0.56 after 10 weeks of CBM use).
Why people fail to recognize their own incompetence. Current Directions in Psychological Science, 12(3), 83-87.
Key Finding: Metacognitive deficits prevent accurate self-assessment. Low performers overestimate ability (58th percentile perception vs. 12th percentile actual). Confidence-weighted assessment with calibration feedback reduces this bias by 40-50%.
Monitoring and control processes in the strategic regulation of memory accuracy. Psychological Review, 103(3), 490-517.
Key Finding: Confidence judgments reflect metacognitive monitoring accuracy. Well-calibrated learners show tight confidence-correctness correlation (Ξ³ > 0.80). Training with confidence-weighted feedback improves calibration and subsequent performance by encouraging appropriate help-seeking behavior.
Constructed Response Theory Justification-Based Assessment Deep Learning Evaluation Rubric-Based Scoring
Combines selected-response (SATA) with constructed-response (open-ended justification) to prevent lucky guessing while assessing reasoning depth. Grounded in argumentation theory and evidence-based reasoning frameworks. Requires learners to externalize thought processes, enabling assessment of reasoning quality beyond answer correctness. Uses AI-powered automated essay scoring (AES) with rubrics evaluating accuracy, relevance, specificity, and clarity. Aligns with Bloom's Taxonomy evaluation level (justifying choices with evidence). Higher cognitive demand prevents surface learning strategies.
History (Causal Analysis) Policy Analysis Scientific Reasoning Law & Ethics
Handbook of automated essay evaluation. Routledge.
Key Finding: Modern AI-powered automated essay scoring (AES) achieves 0.85-0.92 agreement with human raters for short justifications (50-150 words). LLM-based systems (GPT-4, Claude) show particularly high validity for rubric-based evaluation of reasoning quality.
The development of argument skills. Child Development, 74(5), 1245-1260.
Key Finding: Explicit argumentation practice with feedback improves reasoning quality. Students required to justify selections show 40% improvement in argument quality over 12 weeks compared to selection-only controls. Transfer to novel argumentation tasks: Cohen's d = 0.67.
Cognition and the question of test item format. Educational Psychologist, 34(4), 207-218.
Key Finding: Adding justification requirement to selected-response items increases construct validity by 25-35% for higher-order thinking assessment. Eliminates guessing-based success: percentage of correct answers attributable to guessing drops from 18-22% (traditional SATA) to 2-4% (evidence-based SATA).
Scaffolding Theory Prerequisite Testing Hierarchical Learning Models Mastery Learning
Based on Gagne's hierarchical learning theory and Vygotsky's Zone of Proximal Development. Assesses prerequisite knowledge before advancing to dependent concepts, preventing learners from guessing advanced concepts without foundational understanding. Uses hierarchical IRT models where item response probabilities are conditional on prior tier performance. Gamification element (unlocking) increases engagement and motivation. Diagnostic value comes from identifying exact breakdown point in knowledge progression, enabling precise remediation. Aligns with mastery learning (Bloom) where advancement requires demonstrated competency.
Mathematics (Hierarchical Concepts) Programming (Language Features) Sciences (Conceptual Dependencies) Music Theory
The conditions of learning and theory of instruction (4th ed.). Holt, Rinehart and Winston.
Key Finding: Hierarchical task analysis reveals prerequisite relationships. Assessment systems that test prerequisites before dependent skills improve diagnostic accuracy by 45-60% compared to flat testing. Remediation becomes more efficient when precise breakdown point is identified.
Learning for mastery. Evaluation Comment, 1(2), 1-12.
Key Finding: Mastery learning approach where students must demonstrate 80-90% competency on prerequisites before advancing produces 1.0+ SD achievement gains compared to traditional instruction. Tiered progressive assessment operationalizes mastery requirements.
A multidimensional latent trait model for measuring learning and change. Psychometrika, 56(3), 495-515.
Key Finding: Hierarchical IRT models that account for prerequisite dependencies provide more accurate ability estimation (RMSE reduction of 20-30%) compared to flat IRT models. Particularly effective for mathematics and science assessment.
Multidimensional IRT Relational Understanding Interaction Effects Decision Matrix Theory
Based on multidimensional IRT and relational cognition theory. Assesses understanding of interactions between two dimensions (e.g., reaction type Γ condition, task Γ tool). Each row and column represents separate knowledge dimension; cell selection requires understanding their interaction. More complex than linear SATA because requires relational reasoning across dimensions. Professional authenticity: mirrors real decision matrices used in engineering, medicine, business. Uses MIRT models where ability estimation occurs across both dimensions simultaneously. Diagnostic power comes from analyzing row vs. column accuracy patterns.
Chemistry (Reaction Conditions) Engineering (Tool Selection) Medicine (Treatment Γ Patient) Business Strategy
Multidimensional item response theory models. In Multidimensional item response theory (pp. 79-112). Springer.
Key Finding: Matrix items assessing two dimensions simultaneously provide 2.5-3.5 times more information than equivalent number of unidimensional items. Particularly effective for measuring interaction understanding (discrimination parameter for interaction dimension: a = 1.8-2.4).
On the dual nature of mathematical conceptions: Reflections on processes and objects as different sides of the same coin. Educational Studies in Mathematics, 22(1), 1-36.
Key Finding: Relational understanding (understanding relationships between concepts) represents deeper learning than procedural/structural knowledge alone. Matrix assessment items effectively measure relational understanding through interaction analysis.
The causes of errors in clinical reasoning. Academic Medicine, 92(1), 23-30.
Key Finding: Medical diagnostic errors often result from failure to consider interactions between patient characteristics and treatment options. Matrix-format assessment items that require treatment-patient matching show 0.58 correlation with diagnostic accuracy in clinical simulations.
Dynamic Assessment Consequential Decision-Making Branching Scenarios Situated Cognition
Based on dynamic assessment theory (Vygotsky, Feuerstein) and situated cognition framework (Brown, Collins, Duguid). Scenarios evolve based on learner selections, mimicking real-world consequence chains. Each phase's options change dynamically based on prior decisions, preventing guessing strategies and requiring adaptive thinking. Uses dynamic IRT models with conditional branching where later item parameters depend on earlier responses. Assesses not just knowledge but ability to adapt strategy when conditions change. Highest ecological validity of all SATA formats because mirrors real professional decision-making with evolving information.
Crisis Management Business Strategy Clinical Medicine Incident Response
Responding to the challenges of assessing complex cognition. Educational Researcher, 32(6), 3-13.
Key Finding: Dynamic scenario-based assessment items that adapt to learner responses provide 50-70% more construct validity for complex problem-solving compared to static items. Correlation with real-world performance: r = 0.67 (dynamic) vs. r = 0.42 (static).
On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3-62.
Key Finding: Evidence-Centered Design (ECD) framework supports complex assessment with conditional branching. Scenario-progressive items capture evidence about adaptive reasoning that simple items cannot. Information gain from multi-phase scenarios 2.8-4.2 times greater than sum of independent items.
Expert and exceptional performance: Evidence of maximal adaptation to task constraints. Annual Review of Psychology, 47(1), 273-305.
Key Finding: Expert performance characterized by adaptive response to changing conditions. Progressive scenario assessment effectively differentiates expert from intermediate performers (effect size d = 1.4) by requiring strategy modification when initial approaches fail.
Competency-Based Education Mastery Learning Criterion-Referenced Assessment Cut-Score Methodology
Grounded in competency-based education and criterion-referenced assessment theory. Uses predetermined cut scores representing minimum acceptable competency rather than norm-referenced comparisons. Aligns with Angoff and Bookmark standard-setting methods for establishing defensible thresholds. Recognizes that complete mastery isn't always required for progression - partial mastery sufficient for certain competency levels. Uses threshold IRT models where ability categories are defined by minimum correct responses. Reduces all-or-nothing pressure while maintaining standards. Performance levels (Novice, Developing, Proficient, Mastery) provide actionable feedback.
Cybersecurity (Threat Identification) Safety Protocols Competency-Based Credentials Professional Certification
Standard setting: A guide to establishing and evaluating performance standards on tests. Sage Publications.
Key Finding: Criterion-referenced cut scores established through Angoff or Bookmark methods show 0.85-0.92 reliability when set by trained panels. Threshold-based assessment provides clearer actionable feedback than norm-referenced scoring for competency development.
Closing achievement gaps: Revisiting Benjamin S. Bloom's "Learning for Mastery". Journal of Advanced Academics, 19(1), 8-31.
Key Finding: Mastery learning with defined thresholds (typically 80% correctness) produces 0.80 SD achievement gain compared to traditional grading. Clear communication of required competency level reduces anxiety and increases strategic learning behavior.
Comparison of SPRT and sequential Bayes procedures for classifying examinees into two categories using a computerized test. Journal of Educational and Behavioral Statistics, 21(4), 405-414.
Key Finding: Threshold-based classification using IRT achieves 90-95% accuracy with 30-40% fewer items than traditional tests. Sequential testing can terminate once classification confidence exceeds threshold, improving efficiency.
Critical Evaluation Flaw Detection Negative Knowledge Assessment Error Recognition
Based on Ohlsson's theory of learning from errors and negative knowledge research. Shifts cognitive mode from confirmation (selecting correct) to critique (identifying flaws). Assesses "negative expertise" - knowing what's wrong is as important as knowing what's right. Particularly relevant for professional domains where error detection prevents disasters (code review, medical diagnosis, engineering safety). Uses inverted scoring logic: success = eliminating all incorrect while retaining all correct. Diagnostic value from analyzing which incorrect options were retained (missed flaws) vs. which correct options were eliminated (overcriticism).
Software Engineering (Code Review) Debugging Quality Assurance Critical Analysis
Learning from performance errors. Psychological Review, 103(2), 241-262.
Key Finding: Learning from errors requires explicit error recognition and correction. Assessment formats that focus on error detection (elimination tasks) develop negative knowledge more effectively than construction-only formats. Learners trained with elimination tasks show 35% better debugging performance.
How scientists think: Online creativity and conceptual change in science. In T. B. Ward, S. M. Smith, & J. Vaid (Eds.), Creative thought: An investigation of conceptual structures and processes (pp. 461-493). American Psychological Association.
Key Finding: Scientific reasoning involves both confirmation (testing hypotheses) and disconfirmation (identifying invalid hypotheses). Experts spend 40-50% of reasoning time on disconfirmation. Elimination-based assessment items effectively measure disconfirmation reasoning skills.
Sample size calculations in clinical research. Anesthesiology, 97(4), 1028-1029.
Key Finding: In code review and debugging contexts, error detection rate (sensitivity) predicts professional performance better than construction ability. Elimination SATA format shows r = 0.68 correlation with on-the-job debugging effectiveness vs. r = 0.44 for traditional construction items.
Collaborative Learning Theory Social Constructivism Argumentation Assessment Group IRT Models
Based on Vygotsky's social constructivism and Johnson & Johnson's cooperative learning theory. Assesses both individual competency and collaborative skills (argumentation, evidence-based reasoning, consensus building). Three-component scoring: individual accuracy (pre-discussion), group consensus accuracy (post-discussion), and collaboration quality (discourse analysis). Uses novel Group IRT models that estimate individual ability, group ability, and collaboration effectiveness simultaneously. Reflects authentic professional teamwork where individual expertise must be integrated through discussion. AI discourse analysis evaluates argumentation quality, respectful disagreement, evidence citation, and equitable participation.
Policy Analysis Medical Team Decision-Making Engineering Design Business Strategy
An educational psychology success story: Social interdependence theory and cooperative learning. Educational Researcher, 38(5), 365-379.
Key Finding: Cooperative learning with positive interdependence produces 0.67 SD achievement gain over individual learning. When assessment includes both individual accountability (pre-discussion) and group performance (consensus), learning gains increase to 0.88 SD.
Team-based learning: A transformative use of small groups in college teaching. Stylus Publishing.
Key Finding: Team consensus assessments (Readiness Assurance Tests) show consistent pattern: individual β team scores improve 10-20 percentage points. More importantly, quality of team discussion (measured by argumentation analysis) predicts subsequent individual performance: r = 0.54.
Computer-supported collaborative learning. In R. K. Sawyer (Ed.), Cambridge handbook of the learning sciences (pp. 409-425). Cambridge University Press.
Key Finding: AI-supported discourse analysis can reliably identify productive collaboration patterns. Key indicators: turn-taking equity (Gini coefficient < 0.3), evidence citation frequency (>2 per argument), respectful disagreement markers, knowledge co-construction vs. simple aggregation. These patterns predict both learning outcomes and team performance quality.
Conditional Reasoning Nuanced Understanding Context Sensitivity Professional Judgment
Based on conditional reasoning research and situated cognition theory. Assesses understanding that correctness is context-dependent, not absolute - critical for professional judgment. Same treatment/strategy/action may be appropriate in one context, inappropriate in another. Uses multidimensional IRT where each scenario represents separate dimension; ability estimation occurs across contexts. Prevents memorization-based strategies because same options yield different answers across scenarios. Diagnostic value from analyzing context-sensitivity: Do learners adapt choices to changing conditions or apply rigid rules? Measures nuanced professional expertise.
Medical Ethics (Treatment Decisions) Educational Psychology Policy Analysis Cross-Cultural Management
Two courses of expertise. In H. Stevenson, H. Azuma, & K. Hakuta (Eds.), Child development and education in Japan (pp. 262-272). Freeman.
Key Finding: Adaptive expertise (context-sensitive) vs. routine expertise (rule-following). Adaptive experts modify strategies based on context; routine experts apply learned procedures rigidly. Context-dependent assessment items discriminate between these expertise types (discrimination parameter a = 2.1-2.8).
Having our cake and eating it too: Seeking the best of both worlds in expertise research. Medical Education, 43(5), 406-413.
Key Finding: Medical expertise requires recognizing when standard protocols should vs. shouldn't be applied based on patient context. Context-dependent assessment items showing same treatment options across different patient scenarios correlate 0.72 with clinical performance ratings vs. 0.38 for context-independent items.
Cognitive flexibility, constructivism, and hypertext. Educational Technology, 31(5), 24-33.
Key Finding: Cognitive Flexibility Theory emphasizes knowledge must be represented in multiple contexts to support transfer. Single-context learning leads to knowledge "encapsulation" and failure to apply appropriately in novel contexts. Multi-context assessment (3-5 scenarios) improves transfer by 45-60% vs. single-context.
Dynamic Decision-Making Temporal Reasoning Adaptive Strategy Longitudinal IRT
Based on dynamic decision-making research and time-series analysis. Assesses recognition that optimal strategies change over time as conditions evolve. Same actions that were appropriate early may become inappropriate later (and vice versa). Uses dynamic IRT models where item parameters are time-indexed. Tests strategic flexibility and ability to recognize when strategies stop working. Reflects authentic professional contexts where persistence with failing strategies is problematic. Diagnostic value from analyzing temporal adaptation: Do learners modify strategy when conditions change or persist rigidly?
Project Management Medical Treatment Progression Business Strategy Troubleshooting
Dynamic decision making: Human control of complex systems. Acta Psychologica, 81(3), 211-241.
Key Finding: Dynamic decision-making differs from static decisions. Experts in dynamic environments monitor feedback continuously and adjust strategies when effectiveness declines. Temporal assessment items measuring strategy evolution correlate r = 0.69 with dynamic task performance vs. r = 0.31 for static items.
Decision making in action: Models and methods. Ablex Publishing.
Key Finding: Naturalistic Decision Making (NDM) in time-pressured environments emphasizes recognizing when situations have changed enough to warrant strategy shift. Expert firefighters and emergency responders show 4.2 times faster situation reassessment than novices. Temporal SATA items effectively measure this reassessment competency.
The escalation of commitment to a course of action. Academy of Management Review, 6(4), 577-587.
Key Finding: Escalation of commitment (sunk cost fallacy) causes decision-makers to persist with failing strategies. Training with temporal assessment showing strategy evolution reduces this bias by 40-55%. Critical penalty: selecting "add more developers" late in project (Brooks's Law violation) indicates escalation vulnerability.
Metacognition Theory Causal Reasoning Assessment Argumentation Theory Deep Learning Measurement
Extends Evidence-Based SATA (#6) by requiring explicit reasoning chain connecting multiple selected answers, not just individual justifications. Based on Perkins' causal reasoning framework and Toulmin's argumentation model. Assesses systems thinking and understanding of causal relationships between concepts. Prevents lucky guessing because learners must demonstrate coherent mental model connecting all selections. Uses dual scoring: selection accuracy + reasoning chain quality evaluated via AI with rubrics for logical connections, completeness, causal accuracy, insight depth, coherence. Highest cognitive demand of all SATA formats - requires externalization of complete thought process.
Economics (Policy Analysis) Systems Thinking Scientific Reasoning Complex Problem-Solving
Dimensions of causal understanding: The role of complex causal models in students' understanding of science. Studies in Science Education, 41(1), 117-166.
Key Finding: Causal reasoning ranges from simple linear chains (AβB) to complex feedback systems (AβBβC). Students rarely develop complex causal reasoning without explicit instruction. Assessment requiring explicit reasoning chain articulation improves causal understanding by 0.72 SD compared to answer-only assessment.
The uses of argument (Updated ed.). Cambridge University Press.
Key Finding: Effective arguments contain claim, evidence, warrant (reasoning connecting evidence to claim), backing, and qualifier. Assessment requiring complete Toulmin-model arguments produces 55% improvement in argumentation quality over 12 weeks. AI can reliably score Toulmin elements with 0.87 agreement with expert raters.
Comparing expert and novice understanding of a complex system from the perspective of structures, behaviors, and functions. Cognitive Science, 28(1), 127-138.
Key Finding: Experts represent complex systems as interconnected networks with feedback loops; novices as linear chains. Reasoning chain assessment items effectively differentiate expertise levels (discrimination a = 2.4-3.1). Quality of articulated causal model predicts transfer performance: r = 0.79.