Lit Review of Planning in LLMs

I look sat some papers that try to do interpretablity of planning

Apr 16, 2026

I previously looked at some ways you can break down the idea of planning. These were:

Time horizon: how far ahead does the model know things?
Vague vs Specific: how specific vs vague does the model know things?
Option Space: how much is the model considering different possibilities?
Dependency Steps: how much are there consecutive constraints?
Implicit vs Explicit: is the model choosing things moreso inferring based on previous observations or reasoning back from a future end-goal?
Externalized vs Internalized: is the plan kept hidden in it’s activations or has it written down it’s externalized and is following it?
Consistency: does the model have the whole plan decided once, or is it being reevaluated step-by-step?

I look at some clusters in the literature that try to investigate some of these questions. I focus here only on questions with internalized planning, not so much externalized questions.

Examples in literature

I will briefly go over papers I’ve manged to read and understand, categorized this way. There are some follow up papers and papers I still have to read, I may or may not update this post with those in the future.

Theoretical: pre-caching vs breadcrumbs, predicting vs acting

Looking at Timescale: future lens, parascopes

Testing explicitness: poems, hidden goals

Forward Dependancy: chess, blocksworld, sokoban

Methodological: Activation Addition, PatchScopes, LatentQA, Activation Oracles

I list the papers I didn’t get to but seem relevant in the footnotes1

Papers that test Explicitness/Consistency

One good paper is Anthropics biology of LLMs paper from 20252. They test a couple things I will describe here:

Planning in Poems

One main thing is that try to specifically try an aspect of dependency/explicitness in planning. They find that given the activations at the newline after the last word of a poem, the model has some explicit idea of the word the line must end in, such that it can rhyme the poem.

They use a variant of SAEs to identify what words the model is thinking of as at that point, and do causal interventions to remove the word, and see that it then uses one of the other words it was considering instead.

This seems pretty good evidence of quite explicit planning with some dependency steps, on the time-horizon of one line. The model also seems to commit and be consistent about the line and the word once it’s started.

Hidden Goals

In auditing for hidden goals3, they find that training on descriptions of different goals for the model, does cause the model to then behave differently conditional on this. This mostly shows that the models have explicitness in conditioning their outputs, when trained not to externalise this too. It is also consistent on the time-scale of a whole question. Though it mostly test things that seem relatively shallow in terms of dependency steps.

There are a few other papers which also investigate hidden goals, but I consider these a bit out-of-scope for this review.

More Theoretical Papers

Pre-caching vs breadcrumbs

This 2024 paper4 decomposes information processing that is useful for later prediction into two types. Either information is immediately useful for current prediction now, and happens to be useful for predictions down the like (breadcrumbs), or information is not immediately useful and instead clogging up thinking space, and thus is hurting predictions now but on net helps predictions later down the line (pre-caching).

I would say this work is mostly trying to operationalize whether the models apparent planning is more implicit happenstance, or whether it is more explicit. The description of an explicit tradeoff is useful.

They test small language models, and do find a timescale of at least a few tokens, but only test a shallow and non-specific forward dependency. They also test specific information in a toy task and do find some more specific mathematical functions there.

Acting vs predicting tradeoffs

In this 2024 paper, they discuss somewhat the well-known phenomenon of “modal collapse” when doing RL on models. Training on goals makes it such in order to act, one needs to have an idea that you will follow a consistent policy in the future, and thus there is some reduction in having a smaller option space to have a longer time horizon that is more specific and less fuzzy, though this is more useful for implicit rather than explicit planning.

They don’t test time-scale at all. They do describe that this means LLMs post-RL probably have, at least when looking at implicitly, more consistency in planning and more specific ideas of what the future looks like.

Papers that test Time-Scale/Consistency

Future Lens

In this older more foundational 2023 paper they tried different ways of probing the hidden states to predict tokens several steps ahead.

They found they could sometimes do this for a few tokens ahead at an accuracy above bigram statistics. I would sat it is testing time horizon of a few tokens, and shows that sometimes there are very specific plans, but doesn’t show much into other few aspects

ParaScopes

My work from 2024/2025 mostly looks again at time horizon, showing moderate evidence for paragraph-scale planning, not really for longer document-level planning. It seemed to be planning around ~5-10 tokens ahead, when comparing to just how much the model could infer, with a medium amount of explicitness.

There is some degree to which the planning seem non-consistent, since we tested how it’s plans change as it goes from one paragraph to another, and it seems to have much more information at the newline before the new paragraph, much moreso than even the paragraph before that.

Not much testing of dependency steps or implicit vs explicit. I would like to do more causal experiments here some time

Methodological Papers

Most of these papers are relevant, but aren’t explicitly testing planning. They are moreso tools that would potentially be useful for this type of testing.

Activation Addition Engineering

In this work from 2023 they find that you can use activation information from one model, and steer the model to do other things in a different context by linearly adding activations from the first context.

This mostly just shows the existence of option space being broad, and provides tools for how one can test for planning.

Patch Scopes

This 2024 work5 they describe some principles for using models to decode their own activations, and mostly focus on probing what the model is thinking about the next token. This is mostly a something like a generalization of ActAdd and FutureLens.

Meta models for decoding information

In LatentQA6 they use an additional model to decode what the model is thinking on the horizon of next-token to next-phrase. They try to find the more implicit type of planning like “what persona is this” rather than trying to find plans directly.

In Activation Oracles7, they also use a fine-tuned model to try to decode what the activations mean, by answering natural language questions. This seems potentially useful to apply in the future to understand planning too. Though I have some concerns on probes learning too much here too.

Search Problems

There are a few papers looking specifically at decision tree games, to see forward dependency and implicit vs explicit planning. There is some work down this like, labelled “Searching for Search”.

Sokoban

One paper looks at an RL game called sokoban8 and finds that performance increases when one lets the model have some free time at the start to make random moves. This somewhat shows a level of explicitness and forward-dependency planning in the process at the start, moreso than simply being implicit planning.

Chess Models

One paper looks at chess9 which finds some explicit look ahead in seeing that specific moves later do seem to be important to the planning process of that chess model.

In a followup paper10 someone else continue to look at this too, but try to look more moves ahead (I haven’t had time to fully read this paper though)

Good for looking at things that are more explicit and have many dependency steps.

Blockworld

One paper looks at LLMs in games11 and mostly finds that models trained on a specific game can make some good moves some times, but I’m not sure how much this generalizes to normal language models. (In general, I prefer that if it’s a toy task, it may as well be done by a toy model).

What work is still missing?

I think there is still a lot missing in actually being able to decode how language models might-or-might-not be doing different kinds of planning. I do think the axes under which I analyze these do help me maybe more clearly see how these different papers are asking different questions. I think activations oracles could be one method for trying to probe them, and some more work similar to what Anthropic is doing could work too. But I don’t think there is that much understanding in what models are doing overall, nor a specific overarching view how LLMs are planning either (though there are a few papers I still need to catch up on)

This work was day 15/30 of daily posting at Inkhaven. There may be some information I missed, and some things may not be as polished as I would like. Some of the papers I had only read a while ago.

Footnotes

Here are papers I didn’t get to read but seem relevant too:

Detecting and Characterizing Planning in Language Models
Internal Planning in Language Models: Characterizing Horizon and Branch Awareness
Emergent Response Planning in LLMs
Transformers Can Navigate Mazes With Multi-Step Prediction
Latent Planning Emerges with Scale
Interpreting Emergent Planning in Model-Free RL
Auditing Language Models for Hidden Objectives
Thinking Models

On the Biology of a Large Language Model, (Lindsey et al. 2025)

Auditing Languge Models for Hidden Objectives, (Marks et al 2024)

Future Lens: Anticipating Subsequent Tokens from a Single Hidden State (Pal et al. 2023)

Patchscopes: A Unifying Framework for Inspecting Hidden Representations (Ghandeharioun et al. 2024)

LatentQA: Teaching LLMs to Decode Activations Into Natural Language (Pan et al. 2024)

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers (Karvonen, 2025)

Unlocking the Future: Exploring Look-ahead Planning Mechanistic Interpretability in Large Language Models (Men et al. 2024)

Evidence of Learned Look-ahead in a Chess-Playing Neural Network (Jenner et al. 2024)

Understanding the learned look-ahead behavior of chess neural networks, Cruz 2025

Planning in a recurrent neural network that plays Sokoban (Taufeeque et al. 2024,

Cute Suspicions

Discussion about this post

Ready for more?