In November, I stood in front of a group of academics, anxious about what would happen. I was presenting a paper at the cognitive load theory conference—a paper I have so far failed to get published and so have posted to the preprint server, PsyArXiv. I was concerned I may be met with hostility and even mild ridicule. Thankfully, I was not. At the end of my presentation, the questions appeared to be driven by curiosity more than criticism and a number of people told me they enjoyed my presentational style—I had taken heed of cognitive load theory’s redundancy effect and avoided the common practice of presenting information on a slide while simultaneously paraphrasing it. So although I don’t think I am closer to publication, it was a qualified success.
One of my concerns was, and to some extent still is, that I have taken a concept that originated in physics—entropy—that was then applied to information theory by Claude Shannon and applied it for the first time to cognitive load theory. That doesn’t sound too bad, but you need to be aware that entropy has gathered around it a certain amount of pseudo-mystical woo. The second law of thermodynamics—often paraphrased as ‘entropy increases—is the only law of physics that is asymmetrical in time. In other words, if you film an event and then run the film backwards, this is the only law of physics that is violated by the reversed film. This has led some science popularisers to stroke their chins, stare into the middle distance and declare something superficially profound about the ‘arrow of time.’ I have fallen for this myself in the past and there is no doubt this is a reason why the concept is salient to me—salient enough for it to occur to me to apply it to cognitive load theory.
Entropy arose as a result of physicists trying to understand ‘heat engines’ such as the steam engines that were being built and developed during the industrial revolution. There was a problem that needed explaining. A key law of physics is that energy cannot be created or destroyed, only turned from one form into another. When a steam engine burns coal, the heat from that coal doesn’t disappear, it spreads out into the environment. However, once it has spread out into the environment, we can no longer use it to do useful things like power pumps or turn wheels. So, there is something more we need, beyond the concept of energy, to explain what’s going on. Entropy is a measure of this useability. We say that the heat energy in the burning coal is lower in entropy than the heat energy that has spread into the surroundings.
What is the physical difference? People have tried to explain this in a number of ways. The most popular, but perhaps misleading, is to talk about order. The burning coal is more ordered because the heat is concentrated in only place whereas once it has spread out into the surroundings, it is disordered. In this telling, low entropy means high order.
Imagine introducing a drop of ink to a tank of water. Initially, the ink is in a highly ordered state distinct from the water. Once it hits the water, it spreads out and becomes less ordered. Although possible, we would have to wait a very long time—many times the age of the universe—for that ink to form itself back into a drop again, spontaneously.
Which leads to the question: Where does all the order we see around us come from? The answer is that there must be an increase in disorder elsewhere that more than compensates for this. This comes from the heat generated in the heart of the Sun spreading out into the wider universe.
We are getting carried away again.
It was Ludwig Boltzmann who came up with the mathematical model that the idea of disorder is based on. In mathematical terms, he stated that entropy is related to the number of possible states of a system and the probability of each state. By taking logarithms—don’t ask—and using a suitable constant, he was able to show that this computed the same values for entropy as the older heat engines approach. In my view, it is better to hold onto this idea of entropy—that it represents the number of possible states of a system rather than a more vague and confusing notion of disorder.
In 1948 in a seminal essay, Claude Shannon decided to apply this model to information. No longer tethered to energy, informational entropy was a measure of the number of possibilities or states that the information allows. This is where the London weather comes in.
If we have no information about the weather in London right now then it is possible it could be raining and it is possible it could not be raining. If we are given the information that it is raining, then we have reduced the number of possible states from two to one. We have reduced the entropy.
If we have information that makes one state more probable than the other, we have also reduced the entropy of the system because the probability of each state is factored into the calculation.
You don’t have to understand the equation to follow the argument, but for the mathematically inclined, H is entropy, K is an arbitrary constant, n is the number of possible states and p is the probability of each state, i. Similar to the constant, the base of the log that is used is arbitrary. However, since we are talking about information which is often stored in computers as binary strings of 0s and 1s, we may choose to take logs of base 2.
Have you noticed that communicating information—it is raining in London—reduces entropy and that teaching is a process of communicating information? Therefore, I wondered, perhaps we can think of teaching as a process of reducing entropy.
This fits well with cognitive load theory because it suggests information held in long-term memory is at zero entropy because it does not exist as a series of possible states. We just know that 7 x 8 = 56 or a potted timeline of the second world war or the Islamic view of Jesus. Yes, there are things we don’t know—there may be huge gaps—but the things we do know exist in this perfect state. We can draw on them instantly and without effort, perhaps as a result of this. By analogy with heat engines, we may describe this as considerably more useable.
In contrast, information in the environment will always have some uncertainty about how it can be arranged. This can easily overwhelm working memory and so teachers, when introducing new content, need to provide this in a low enough entropy state to be within these limits.
The idea implies that whatever happens in the process of teaching, and however we teach, learning involves reducing external sources of information to a low or zero entropy state in long-term memory. There are nuances to this, perhaps. Connecting a stimulus in the environment to the correct schema in long-term memory requires pulling the information back the other way during teaching—something we have come to call retrieval practice—and this doesn’t always look like the process of making content ever simpler. Indeed, part of the process could involve the uncovering and resolving of remaining uncertainties so that schema can be extended to these circumstances, too. Is this what is supposedly germane about germane load? I’m not sure.
Clearly, I think this idea has potential. I’m not sure what potential exactly and that is, perhaps, the problem. What does this add that is not already there in cognitive load theory? Is it more fundamental? Does it solve any problems in the theory or simplify it? I am not sure yet. But I am going to continue to explore it.
I think information theory is a fascinating way to look at learning. One issue is it is a lot more complicated than Shannon’s analysis of a sender and receiver of messages and a message channel.
You have at least the idea of a sender (teacher) and two other end points: long term memory and short term memory and channels between these in each direction.
You could simplify this by just looking at two channels for messages one from short term memory to long term memory and one from long term memory to short term.
A practical application of Shannon’s theory is the study of how message alphabets and constraints on them can be optimally used to improve communication over an imperfect channel. Stephen Pinker touches on how this works in spoken language with regular and irregular verbs.
For example “dived” uses a root dive and a common ending “ed” which is constrained to always mean the past tense. The North American “dove” doesn’t have this structure and requires us to know what is meant without the clue of the close match to “dive” and the common ending “ed”. According to Pinker we optimally use irregular verbs to make communication better and this works by having common verbs irregular and less common ones regular.
That is inline with Shannon’s theories where a less common word benefits from the structure to reduce the decoding problem while a common word benefits from being shorter or more unique but being well known is more likely to match the sounds we hear.
In language we are matching sounds to meanings and efficiency is about using the least number of sounds to give us a good chance of not making a mistake.
The example in your paper of 3x = 18 is interesting. When people first learn simple algebra like this you are combining two languages into a new one - the language of arithmetic and the language of letters. This is useful once we know what is going on because we are already good a recognizing all the symbols when spoken or written. However, from an information passing problem we might see the problem of a brain that knows spelling and arithmetic struggling to match algebra to either one of these.
An interesting experiment would be to see if introducing algebra using a Greek letter works better - sometimes this type of thing is done using an open square to introduce the unknown so 3 x [ ] = 18.
(You can see another issue going on with the use of x as a variable and as the symbol for multiplication).
A Greek letter might work better as it is easier to say alpha than “what goes in the square” and it introduces the idea of labels for variables without the close proximity to other messages such as words with x in them or x as multiplication.
You could also experiment, taking a hint from programming, using a full word for the variable that represents a physical thing and only introduce the abstraction of a generic label once the process of solving for the variable is no longer new information.
Mathematics is a language which sheds redundancy and that works because it relies on a high degree of familiarity with the language. It becomes a great language for those that know it. But this works against those that don’t.
Looking at it this way suggests introducing the efficiency slowly will work better for learning. Information theory provides a way to decide if you are doing that.
However to use information theory you have to know the information content of the language used - how much is new, how much is in close proximity to different concepts and needs more redundancy to separate it.
A key point from Shannon’s theory is that the information content of the message depends on the receiver’s existing knowledge of the message language and message content. In learners this is dynamic.
(You might think of the natural reaction to a boring lesson on something we already know as our minds determination to apply Shannon’s theory well.)