LLM Reasoning 相关度: 9/10

Moral Preferences of LLMs Under Directed Contextual Influence

Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie, Dmitrii Krasheninnikov
arXiv: 2602.22831v1 发布: 2026-02-26 更新: 2026-02-26

AI 摘要

研究上下文对LLM道德决策的影响,发现LLM在道德选择上易受引导且存在反常现象。

主要贡献

  • 提出了一种评估上下文中LLM道德偏好的方法。
  • 发现LLM的道德选择易受表面相关的上下文影响。
  • 揭示了LLM在道德选择上可能出现反直觉行为。

方法论

构建了基于电车难题的评估框架,通过改变上下文来观察LLM在道德选择上的变化。

原文摘要

Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage settings. We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response. We find that: (i) contextual influences often significantly shift decisions, even when only superficially relevant; (ii) baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence; (iii) influences can backfire: models may explicitly claim neutrality or discount the contextual cue, yet their choices still shift, sometimes in the opposite direction; and (iv) reasoning reduces average sensitivity, but amplifies the effect of biased few-shot examples. Our findings motivate extending moral evaluations with controlled, direction-flipped context manipulations to better characterize model behavior.

标签

LLM 道德推理 上下文影响 电车难题

arXiv 分类

cs.LG cs.AI cs.CL cs.CV cs.CY