Research — Ben Levinstein

Published & Forthcoming

Standards for Belief Representations in LLMs +

(with Daniel Herrmann)

Minds & Machines (forthcoming)

As large language models (LLMs) continue to demonstrate remarkable abilities across various domains, computer scientists are developing methods to understand their cognitive processes, particularly concerning how (and if) LLMs internally represent their beliefs about the world. However, this field currently lacks a unified theoretical foundation to underpin the study of belief in LLMs. This article begins filling this gap by proposing adequacy conditions for a representation in an LLM to count as belief-like. We argue that, while the project of belief measurement in LLMs shares striking features with belief measurement as carried out in decision theory and formal epistemology, it also differs in ways that should change how we measure belief. Thus, drawing from insights in philosophy and contemporary practices of machine learning, we establish four criteria that balance theoretical considerations with practical constraints. Our proposed criteria include accuracy, coherence, uniformity, and use, which together help lay the groundwork for a comprehensive understanding of belief representation in LLMs. We draw on empirical work showing the limitations of using various criteria in isolation to identify belief representations.

Bigger, Badder Bugs +

(with Jack Spencer)

Mind (forthcoming)

In this paper we motivate the ‘principles of trust’, chance-credence principles that are strictly stronger than the New Principle yet strictly weaker than the Principal Principle, and argue, by proving some limitative results, that the principles of trust conflict with Humean Supervenience.

Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks +

(with Daniel Herrmann)

Philosophical Studies (2024)

We consider the questions of whether or not large language models (LLMs) have beliefs, and, if they do, how we might measure them. First, we evaluate two existing approaches, one due to Azaria and Mitchell (2023) and the other to Burns et al. (2022). We provide empirical results that show that these methods fail to generalize in very basic ways. We then argue that, even if LLMs have beliefs, these methods are unlikely to be successful for conceptual reasons. Thus, there is still no lie-detector for LLMs. After describing our empirical results we take a step back and consider whether or not we should expect LLMs to have something like beliefs in the first place. We consider some recent arguments aiming to show that LLMs cannot have beliefs. We show that these arguments are misguided. We provide a more productive framing of questions surrounding the status of beliefs in LLMs, and highlight the empirical nature of the problem. We conclude by suggesting some concrete paths for future work.

Decision Theory without Luminosity +

(with Yoaav Isaacs)

Mind (2024)

Our decision-theoretic states are not luminous. We are imperfectly reliable at identifying our own credences, utilities, and available acts, and thus can never be more than imperfectly reliable at identifying the prescriptions of decision theory. The lack of luminosity affords decision theory a remarkable opportunity---to issue guidance on the basis of epistemically inaccessible facts. We show how a decision theory can guarantee action in accordance with contingent truths about which an agent is arbitrarily uncertain. It may seem that such advantages would require dubiously adverting to externalist facts that go beyond the internalism of traditional decision theory, but this is not so. Using only the standard repertoire of decision-theoretic tools, we show how to modify existing decision theories to take advantage of this opportunity.
These improved decision theories require agents to maximize conditional expected utility—expected utility conditional upon an agent’s actual decision-situation. We call such modified decision theories “self-confident”. These self-confident decision theories have a distinct advantage over standard decision theories---their prescriptions are better.

Evidential Decision Theory and the Ostrich +

(with Yoaav Isaacs)

Philosophers' Imprint (2024)

Evidential Decision Theory is flawed, but its flaws are not fully understood. David Lewis (1981) famously charged that EDT recommends an irrational policy of managing the news and “commends the ostrich as rational”. Lewis was right, but the case he appealed to—Newcomb’s Problem—does not demonstrate his conclusion. Indeed, decision theories other than EDT, such as Committal Decision Theory and Functional Decision Theory, agree with EDT's verdicts in Newcomb’s Problem, but their flaws, whatever they may be, do not stem from any ostrich-like recommendations. We offer a new case which shows that EDT mismanages the news, thus vindicating Lewis’s original charge. We argue that this case reveals a flaw in the “Why ain’cha rich?” defense of EDT. We argue further that this case is an advance on extant putative counterexamples to EDT.

Probability and Informed Consent +

(with Nir ben Moshe and Jonathan Livengood)

Theoretical Medicine and Bioethics (2023)

In this paper, we illustrate some serious difficulties involved in conveying information about uncertain risks and securing informed consent for risky interventions in a clinical setting. We argue that in order to secure informed consent for a medical intervention, physicians often need to do more than report a bare, numerical probability value. When probabilities are given, securing informed consent generally requires communicating how probability expressions are to be interpreted and communicating something about the quality and quantity of the evidence for the probabilities reported. Patients may also require guidance on how probability claims may or may not be relevant to their decisions, and physicians should be ready to help patients understand these issues.

Accuracy, Deference, and Chance +

The Philosophical Review (2023)

Chance both guides our credences and is an objective feature of the world. How and why we should conform our credences to chance depends on the underlying metaphysical account of what chance is. I use considerations of accuracy (how close your credences come to truth-values) to propose a new way of deferring to chance. The principle I endorse, called the Trust Principle, requires chance to be a good guide to the world, permits modest chances, tells us how to listen to chance even when the chances are modest, and entails but is not entailed by the New Principle. As I show, a rational agent will obey this principle if and only if she expects chance to be at least as accurate as she is on every good way of measuring accuracy. Much of the discussion, and the technical results, extend beyond chance to deference to any kind of expert. Indeed, you will trust someone about a particular question just in case you expect that person to be more accurate than you are about that question.

Deference Done Better +

(with Kevin Dorst, Bernhard Salow, Brooke E. Husic, & Branden Fitelson)

Philosophical Perspectives (2021)

There are many things—call them ‘experts’—that you should defer to in forming your opinions. The trouble is, many experts are modest: they’re less than certain that they are worthy of deference. When this happens, the standard theories of deference break down: the most popular (“Reflection”-style) principles collapse to inconsistency, while their most popular (“New-Reflection”-style) variants allow you to defer to someone while regarding them as an anti-expert. We propose a middle way: deferring to someone involves preferring to make any decision using their opinions instead of your own. In a slogan, deferring opinions is deferring decisions. Generalizing the proposal of Dorst (2020a), we first formulate a new principle that shows exactly how your opinions must relate to an expert’s for this to be so. We then build off the results of Levinstein (2019) and Campbell-Moore (2020) to show that this principle is also equivalent to the constraint that you must always expect the expert’s estimates to be more accurate than your own. Finally, we characterize the conditions an expert’s opinions must meet to be worthy of deference in this sense, showing how they sit naturally between the too-strong constraints of Reflection and the too-weak constraints of New Reflection.

Strict Propriety Is Weak +

(with Catrin Campbell-Moore )

Analysis (2021)

Considerations of accuracy – the epistemic good of having credences close to truth-values – have led to the justification of a host of epistemic norms. These arguments rely on specific ways of measuring accuracy. In particular, the accuracy measure should be strictly proper. However, the main argument for strict propriety supports only weak propriety. But strict propriety follows from weak propriety given strict truth directedness (which is non-negotiable) and additivity (which is both very common and plausible). So no further argument is necessary.

Cheating Death in Damascus +

(with Nate Soares)

The Journal of Philosophy (2020)

Evidential and Causal Decision Theory are the leading contenders as theories of rational action, but both face fatal counterexamples. We present some new counterexamples, including one in which the optimal action is causally dominated. We also present a novel decision theory, Functional Decision Theory (FDT), which simultaneously solves both sets of counterexamples. Instead of considering which physical action of theirs would give rise to the best outcomes, FDT agents consider which output of their decision function would give rise to the best outcome. This theory relies on a notion of subjunctive dependence, where multiple implementations of the same mathematical function are considered (even counterfactually) to have identical results for logical rather than causal reasons. Taking these subjunctive dependencies into account allows FDT agents to outperform CDT and EDT agents in, e.g., the presence of accurate predictors. While not necessary for considering classic decision theory problems, we note that a full specification of FDT will require a non-trivial theory of logical counterfactuals and algorithmic similarity.

Act Consequentialism without Free Rides +

(with Preston Greene)

Philosophical Perspectives (2020)

Consequentialist theories determine rightness solely based on real or expected consequences. Although such theories are popular, they often have difficulty with generalizing intuitions, which demand concern for questions like “What if everybody did that?” Rule consequentialism attempts to incorporate these intuitions by shifting the locus of evaluation from the consequences of acts to those of rules. However, detailed rule‐consequentialist theories seem ad hoc or arbitrary compared to act consequentialist ones. We claim that generalizing can be better incorporated into consequentialism by keeping the locus of evaluation on acts but adjusting the decision theory behind act selection. Specifically, we should adjust which types of dependencies the theory takes to be decision‐relevant. Using this strategy, we formulate a new theory, generalized act consequentialism, which we argue is more compelling than rule consequentialism both in modeling the actual reasoning of generalizers and in delivering correct verdicts.

An Objection of Varying Importance to Epistemic Utility Theory +

Philosophical Studies (2019)

Some propositions are more epistemically important than others. Further, how important a proposition is is often a contingent matter—some propositions count more in some worlds than in others. Epistemic Utility Theory cannot accommodate this fact, at least not in any standard way. For EUT to be successful, legitimate measures of epistemic utility must be proper, i.e., every probability function must assign itself maximum expected utility. Once we vary the importance of propositions across worlds, however, normal measures of epistemic utility become improper. I argue there isn’t any good way out for EUT.

Imprecise Epistemic Values and Imprecise Credences +

Australasian Journal of Philosophy (2019)

A number of recent arguments purport to show that imprecise credences are incompatible with accuracy-first epistemology. If correct, this conclusion suggests a conflict between evidential and alethic epistemic norms. In the first part of the paper, I claim that these arguments fail if we understand imprecise credences as indeterminate credences. In the second part, I explore why agents with entirely alethic epistemic values can end up in an indeterminate credal state. Following William James, I argue that there are many distinct alethic values that a rational agent can have. Furthermore, such an agent is rationally permitted not to have settled on one fully precise value function. This indeterminacy in value will sometimes result in indeterminacy in epistemic behaviour—that is, because the agent’s values aren’t settled, what she believes might not be.

The Foundations of Epistemic Decision Theory +

(with Jason Konek)

Mind (2019)

According to accuracy-first epistemology, accuracy is the fundamental epistemic good. Epistemic norms–Probabilism, Conditionalization, the Principal Principle, and so on–have their binding force in virtue of helping to secure this good. To make this idea precise, accuracy-firsters invoke Epistemic Decision Theory (EPDT) to determine which epistemic policies are the best means toward the end of accuracy. Hilary Greaves and others have recently challenged the tenability of this programme. Their arguments purport to show that EPDT encourages obviously epistemically irrational behaviour. We develop firmer conceptual foundations for EPDT. First, we detail a theory of praxic and epistemic good. Then we show that, in light of their very different good-making features, EPDT will evaluate epistemic states and epistemic acts according to different criteria. So, in general, rational preference over states and acts won’t agree. Finally, we argue that based on direction-of-fit considerations, it is preferences over the former that matter for normative epistemology, and that EPDT, properly spelt out, arrives at the correct verdicts in a range of putative problem cases.

A Pragmatist's Guide to Epistemic Utility +

Philosophy of Science (2017)

We use a theorem from M. J. Schervish to explore the relationship between accuracy and practical success. If an agent is pragmatically rational, she will quantify the expected loss of her credence with a strictly proper scoring rule. Which scoring rule is right for her will depend on the sorts of decisions she expects to face. We relate this pragmatic conception of inaccuracy to the purely epistemic one popular among epistemic utility theorists.

Accuracy Uncomposed: Against Calibrationism +

Episteme (2017)

Pettigrew offers new axiomatic constraints on legitimate measures of inaccuracy. His axiom called ‘Decomposition’ stipulates that legitimate measures of inaccuracy evaluate a credence function in part based on its level of calibration at a world. I argue that if calibration is valuable, as Pettigrew claims, then this fact is an explanandum for accuracy-first epistemologists, not an explanans, for three reasons. First, the intuitive case for the importance of calibration isn't as strong as Pettigrew believes. Second, calibration is a perniciously global property that both contravenes Pettigrew's own views about the nature of credence functions themselves and undercuts the achievements and ambitions of accuracy-first epistemology. Finally, Decomposition introduces a new kind of value compatible with but separate from accuracy-proper in violation of Pettigrew's alethic monism.

Permissive Rationality and Sensitivity +

Philosophy and Phenomenological Research (2017)

Permissivism about rationality is the view that there is sometimes more than one rational response to a given body of evidence. In this paper I discuss the relationship between permissivism, deference to rationality, and peer disagreement. I begin by arguing that—contrary to popular opinion—permissivism supports at least a moderate version of conciliationism. I then formulate a worry for permissivism. I show that, given a plausible principle of rational deference, permissive rationality seems to become unstable and to collapse into unique rationality. I conclude with a formulation of a way out of this problem on behalf of the permissivist.

With All Due Respect: The Macro-Epistemology of Disagreement +

Philosophers' Imprint (2015)

In this paper, I develop a new kind of conciliatory answer to the problem of peer disagreement. Instead of trying to guide an agent’s updating behaviour in any particular disagreement, I establish constraints on an agent’s expected behaviour and argue that, in the long run, she should tend to be conciliatory toward her peers. I first claim that this macro-approach affords us new conceptual insight on the problem of peer disagreement and provides an important angle complementary to the standard micro-approaches in the literature. I then detail the import of two novel results based on accuracy-considerations that establish the following: An agent should, on average, give her peers equal weight. However, if the agent takes herself and her advisor to be reliable, she should usually give the party with a stronger opinion more weight. In other words, an agent’s response to peer disagreement should over the course of many disagreements average out to equal weight, but in any particular disagreement, her response should tend to deviate from equal weight in a way that systematically depends on the actual credences she and her advisor report.

Leitgeb and Pettigrew on Accuracy and Updating +

Philosophy of Science (2012)

Leitgeb and Pettigrew argue that (1) agents should minimize the expected inaccuracy of their beliefs and (2) inaccuracy should be measured via the Brier score. They show that in certain diachronic cases, these claims require an alternative to Jeffrey Conditionalization. I claim that this alternative is an irrational updating procedure and that the Brier score, and quadratic scoring rules generally, should be rejected as legitimate measures of inaccuracy.

Facts, Interpretation, and Truth in Fiction +

The British Journal of Aesthetics (2007)

Drafts & Preprints

Does ChatGPT Have a Mind? +

(with Simon Goldstein)

This paper examines the question of whether Large Language Models (LLMs) like ChatGPT possess minds, focusing specifically on whether they have a genuine folk psychology encompassing beliefs, desires, and intentions. We approach this question by investigating two key aspects: internal representations and dispositions to act. First, we survey various philosophical theories of representation, including informational, causal, structural, and teleosemantic accounts, arguing that LLMs satisfy key conditions proposed by each. We draw on recent interpretability research in machine learning to support these claims. Second, we explore whether LLMs exhibit robust dispositions to perform actions, a necessary component of folk psychology. We consider two prominent philosophical traditions, interpretationism and representationalism, to assess LLM action dispositions. While we find evidence suggesting LLMs may satisfy some criteria for having a mind, particularly in game-theoretic environments, we conclude that the data remains inconclusive. Additionally, we reply to several skeptical challenges to LLM folk psychology, including issues of sensory grounding, the "stochastic parrots" argument, and concerns about memorization. Our paper has three main upshots. First, LLMs do have robust internal representations. Second, there is an open question to answer about whether LLMs have robust action dispositions. Third, existing skeptical challenges to LLM representation do not survive philosophical scrutiny.

Robustness and Resource Control: Unpacking Power-Seeking in AI +

(with Bruce Rushing)

The goal in this essay is to make progress on plausibility of instrumental convergence of power-seeking and how it relates to the actual objectives advanced AGIs will pursue.

Low Impact Artificial Intelligences +

(with Stuart Armstrong)

There are many goals for an AI that could become dangerous if the AI becomes superintelligent or otherwise powerful. Much work on the AI control problem has been focused on constructing AI goals that are safe even for such AIs. This paper looks at an alternative approach: defining a general concept of `low impact'. The aim is to ensure that a powerful AI which implements low impact will not modify the world extensively, even if it is given a simple or dangerous goal. The paper proposes various ways of defining and grounding low impact, and discusses methods for ensuring that the AI can still be allowed to have a (desired) impact despite the restriction. The end of the paper addresses known issues with this approach and avenues for future research.