Metacognition Our brain is a statistical machine, and understanding how it works may help us using it better. Tue, 18 Dec 2018 07:17:21 +0000 en-US hourly 1 Next generation AI Thu, 28 Sep 2017 08:45:12 +0000 Summary: AI networks may become both faster learners and potentially have moral standards by introducing internal reward systems. Recent scientific results indicate a shift in AI. Do we see a […]

The post Next generation AI appeared first on Metacognition.

Summary: AI networks may become both faster learners and potentially have moral standards by introducing internal reward systems. Recent scientific results indicate a shift in AI. Do we see a new generation of AI? 

Artificial intelligence, or AI for short, is all around. Google, Microsoft, IBM, Tesla and Facebook, they are all been doing it for a long time, and this is just the start of it. The Russian president Vladimir Putin recently said that those in the forefront of AI will rule the world, whereas others like Elon Musk and Bill Gates raise concerns regarding the dangers of AI and the creation of super human intelligence.

Where are we heading? Are we in charge, or is the process already beyond control?

One thing is for sure, the speed of development of AI has sky-rocketed since the first attempts were made to mimic human brain learning with simple artificial neural networks several decades ago. Much of the theory behind neural networks has not changed since then, although some algorithmic improvements have come about, like reinforcement learning and convolutional networks, but the real breakthrough of AI in later years has been made possible by brute force, through big data and increased computational power.

Still, the artificial learning algorithms are not as efficient as the natural learning processes of the human brain, yet. Humans are in some respect much more efficient at learning than the computers, although computers may digest much larger quantities of data than us per time unit. We can extract the essence of information (that is, to learn) from only a few repeated examples, whereas a computer may need thousands of input examples in comparison. In some circumstances we may in fact need only a single experience to learn about, for instance, a life threatening danger.

It is no question that the learning algorithms used in AI are computationally heavy and quite inefficient. The AI pioneer Geoffrey Hinton recently expressed he is “deeply suspicious” of the back-propagation step involved in the training of neural networks, and he calls for a new path to AI. Hence, new inspiration is needed to make the algorithms more efficient, and what is then more natural than to turn to the natural neural networks of our own brains for this inspiration?

But faster and more efficient learning does not calm the nerves of those who fear the super human intelligence, on the contrary! How can we be more confident that artificial intelligence will behave according to the moral standards modern and developed societies live by? Also here we should turn to our own brains for inspiration, because after all, humans are capable of thinking and behaving morally, even if the news are filled with counter examples every day. We may still hope that it will be possible to create AI with super human moral standards as well as intelligence!

Geoffrey Hinton is probably right, we need a new path to AI. We need a next generation AI!

Derivative of image by Filosofias filosoficas, Licensed by Creative Commons

The next generation AI must learn more efficiently and be more human-like in the way it acts according to values and ethical standards set by us.

Three fairly recent scientific findings in AI research and neuroscience may together reveal how next generation AI must be developed.

  • The first important result is found within the theory of “information bottlenecks“ for deep learning networks by Naftali Tishby and co-workers at the Hebrew University of Jerusalem.
  • The second result is the new curiosity driven learning algorithm developed by Pulkit Agrawal and co-workers at the Berkeley Artificial Intelligence Research Lab.
  • And finally, a fresh off the shelf paper by John Henderson and colleagues at the Center for Mind and Brain at the University of California, Davis, shows how visual attention is guided by the internal and subjective evaluation of meaning.

These three results all point, directly or indirectly, to a missing dimension in AI today, namely a top-down control system where higher levels of abstraction actively influence how input signals are filtered and perceived. Today’s AI algorithms are dominantly bottom-up in the way input signals are trained from the bottom and up to learn given output categories. The back propagation step of deep learning networks is in that sense no top-down control system since the adjustments of weights in the network has only one main purpose, to maximize an extrinsic reward function. By extrinsic we mean the outer cumulative reward that the system has been set to maximize during learning.

The fundamental change in AI must come with the shift to applying intrinsic as well as extrinsic reward functions in AI.

Let’s begin with the information bottleneck theory to shed light on this postulate. In a Youtube video  Naftali Tishby explains how the information bottleneck theory reveals previously hidden properties of deep learning networks. Deep neural networks have been considered as “black boxes” in the way they are self-learning and difficult to understand from the outside, but the new theory and experiments reveal that learning in a deep network typically has two phases.

  • First there is a learning phase were the first layers of the network try to encode virtually everything about the input data, including irrelevant noise and spurious correlations.
  • Then there is a compression phase, as deep learning kicks in, where the deeper layers start to compress information into (approximate minimal) sufficient statistics that are as optimal as possible with regard to prediction accuracy for the output categories.

The latter phase may also be considered as a forgetting phase where irrelevant variation is forgotten, retaining only representative and relevant “archetypes” (as Carl Jung would have referred to them).

We may learn a lot about how the human brain works from this, but still, as mentioned above with regard to efficacy of learning, the compression phase appears to kick in much earlier in natural neural networks than in the artificial ones. Humans seem to be better at extracting archetypes. How can this be?

I believe that the information bottleneck properties observed in artificial deep learning networks describe quite closely the learning phases of newborn babies. Newborn babies are, like un-trained AI’s more like tabula rasa in the sense that there are no/few intrinsic higher levels of abstractions prior to the learning phase. Babies also need a lot of observations of its mother’s face, her smell, her sound, before the higher abstraction level of “mother” is learned, just like an AI would.

But here the natural and the artificial networks deviate from one another. The baby may carry the newly learned concept of a mother as an intrinsic prior for categorization as it goes on to learn who is the father, that food satisfies hunger, and so on. As the child develops it builds upon an increasing repertoire of prior assumptions, interests, values and motivations. These priors serve as top-down control mechanisms that help the child to cope with random or irrelevant variation to speed up data compression into higher abstraction levels.

My prediction is therefore that compression into archetypal categories, which has been observed in deep learning networks, kicks in much earlier in networks where learning is both a combination of bottom-up extrinsic learning and top-down intrinsic control. Hence, by including priors into AI, learning may become much faster.

The next question is how priors may be implemented as intrinsic control systems in AI. This is where the second results by Pulkit Agrawal et al. comes in as a very important and fundamental shift in AI research. They aimed at constructing a curiosity-driven deep learning algorithm. The important shift here is to train networks to maximize internal or intrinsic rewards rather than extrinsic rewards, which has been common insofar.

Their approach to building self-learning and curious algorithms is to use the familiar statistical concept of prediction error, a measure of surprise or entropy, as an intrinsic motivation system. Put short, the AI agent is rewarded if it manages to seek novelty, that is, unpredictable situations. The idea is that this reward system will motivate curiosity in AI, and their implementation of an AI agent playing the classic game of Super Mario serves as a proof of concept. Read more about this here.

I believe the researchers at Berkeley are onto something very important in order to understand learning in real brains. As I wrote in an earlier blog post, learning is very much about attention, and attention is according to the salience hypothesis assumed to be drawn towards surprise. So this fits well into the work of Agrawal et al. However, in another blog post I also discussed how attention depends on a mix of extrinsic sensation and intrinsic bias. In a statistical framework we would rephrase this to a mix of input data likelihood and prior beliefs into a posterior probability distribution across possible attention points, and that our point of attention is then sampled randomly from this posterior distribution.

The point here is that prediction error, as a drive for learning, also depends on the internal biases.

These biases are the interests, values, and emotions we all possess that guide our attention, not only towards novelty, but towards novelty within a context that we find interesting and relevant and meaningful.

You and I will most likely have different attention points given the same sensory input due to our different interests and values. These biases actually influence how we perceive the world!

My good friend and colleague, the psychologist Dr. Helge Brovold at the National Centre for Science Recruitment in Trondheim, Norway, states this nicely:

“We don’t observe the world as IT IS. We observe the world as WE ARE”

This has now been confirmed in a recent study by Henderson et al. at the Center for Mind and Brain at the University of California, Davis. They show in experiments that visual attention indeed is drawn towards meaning rather than surprise or novelty alone. This is contrary to the salience hypothesis, which has been the dominant view in later years, according to Henderson. Human attention is thus guided by top-down intrinsic bias, an inner motivation guided by meaning, interest, values or feelings.

As Agrawal and his colleagues implemented their intrinsic prediction-error (or entropy) driven learning algorithm for the Super Mario playing agent, they encountered exactly the problem that some sort of top-down bias was needed to avoid the agent to get stuck in a situation facing purely random (and hence unpredictable) noise. Noise is kind of irrelevant novelty and should not attract curiosity and attention. To guide the algorithm away from noise they had to define what was relevant for the learning agent, and they defined as relevant the part of the environment which has the potential to affect the learning agent directly. In our context we can translate this as the relevant or meaningful side of the environment. However, this is only one way to define relevance! It could just as well be moral standards acting as intrinsic motivation for the learning agent.

At this point we may now argue that the inclusion of top-down intrinsic bias in addition to extrinsic reward systems in deep learning may both speed up the learning process as well as open up for AI guided by moral an ethics. Strong ethical prior beliefs may be forced upon the learning network affecting the learning algorithm to compress data fixed around given moral standards.

In my opinion, this is the direction AI must move.

But… there is no such thing as a free lunch…

The introduction of intrinsic motivation and bias comes with a cost. A cost we all know from our own lives. Biases make us subjective.

The more top-down priors are used and the stronger they are, the more biased learning will be. In the extreme case of maximal bias, sensory input will provide no additional learning effect. The agent will be totally stuck in its intrinsic prejudices. I guess we all know examples of people who stubbornly stick to their beliefs despite hard and contradictory, empirical evidence.

However, the fact that human perception in this way tends to be biased by prior beliefs is, on the other hand, an indication that this is indeed how natural learning networks learn…, or don’t learn…




The post Next generation AI appeared first on Metacognition.

]]> 0
The random walk of thought Sun, 04 Dec 2016 22:07:06 +0000 I’m sure you have experienced the frustration of not recalling a person you meet unexpectedly and who obviously knows you quite well. The name lingers at the tip of your […]

The post The random walk of thought appeared first on Metacognition.

I’m sure you have experienced the frustration of not recalling a person you meet unexpectedly and who obviously knows you quite well. The name lingers at the tip of your tongue, but is impossible to retrieve from memory, and you struggle not to give away that you are lost. Then the person may say something that gives you a clue, and suddenly associations and connected memories rush through your brain. Small pieces of the puzzle fall into place, and finally you recall the name, just seconds before it starts getting embarrassing. You elegantly and subtly verify that you know who you’re talking to and the crisis has been avoided.

Sounds familiar? I’ve been there! But why was it so hard to look up the information I obviously had saved on my hard disk? The answer is that our minds have no page index or table of contents like a regular book. Thinking is based on the principle of association. The next thought follows the previous, and to recall a memory from the library, we need a cue from which we can walk into the memory by following a path of associations. In statistical terms, thinking is a stochastic process, a ‘random walk’ which literary reflects a random walk of signals between neurons in the brain.

Random walk? Are my thoughts just nonsense balderdash? Of course not. The word ‘random’ has a certain daily language interpretation which is different from the statistical. Random just means that it is not completely deterministic. However, some random outputs from the stochastic process may be much more likely than others. I think a certain cognitive randomness is a necessary condition for the existence of free will, but that is another subject. (For my statistical view on whether we have free will, see my previous post The statistics of free will)

Let’s return to the associations. If I ask you: What is your first association to the word ‘yellow’?


Maybe you were thinking of the sun, a banana or perhaps a submarine, whereas ‘car’ would probably be less likely (unless you happen to own a yellow car).

The probabilities of moving to other thoughts from the current are called transition probabilities in statistics. My personal transition probability from yellow to submarine is quite high since I’m old enough to remember the Beatles. After thinking ‘submarine’ my continuing random walk of thought could be: Beatles-Lennon-Shot-Dead. Those were in fact my immediate associations. Your thoughts would probably take another random walk.

Stochastic processes are well studied in statistics, and it may be worthwhile to draw some connections between what we have learned from statistical research and cognitive processes like thinking and conversation. Such a comparison may give us new meta-cognitive perspectives on thinking, conversation, personality and psychopathological conditions like obsessive compulsive disorder (OCD), attention-hyperactive disorder (AHD) and Alzheimer’s Disease (AD). In this blog post I will look at the properties of some specific stochastic processes known as Markov Processes and Hidden Markov Processes in this cognitive context.


Let’s start out by assuming that at any given (and conscious) moment our thoughts are sampled from a fixed repertoire of potential thoughts and memories and that we are not influenced by external factors. We may call the thought repertoire the ‘state space’ of thought. The likelihood of the various cognitive outcomes from this state space depends on our history of experiences, the situation we’re in at the moment (the context), on interests and values, on the focus level (high or low focus) and on our personality traits. But it also depends on the current thought as a primer for the next. These factors define together a distribution of transition probabilities over the state space of thought. And from this distribution we sample what to think next.


Markov processes

The random sampling process of thoughts is very much alike the random sampling of parameters from candidate distribution as used in Markov Chain Monte Carlo (MCMC) estimation in Bayesian statistical inference. By MCMC a random walk process is initiated and sampling is run for a long time in order to estimate the unknown probability distribution from the sampled parameter values. The cognitive translation of this is as follows: By monitoring the random walk of thought of a person for a long time and recording the thoughts, we could get an estimate of the likelihood of all thoughts in the state space. If the random process behaves properly the estimate would be independent of the current thought and context. We might as well get estimates of other general (yet personal) things like the most typical thought and the variability of thoughts (some are more narrow-minded than others). Of course, such thought-monitoring is not possible unless you are monitoring yourself. The thinking processes for any other person is in that respect an example of what in statistics is known as a ‘hidden stochastic processes’. An output from this hidden process is only observed now and then as this person speaks. I will come back to this later.

A Markov process (after the Russian mathematician Andrey Markov) is a stochastic process where the probabilities of entering the next state given the entire history of previous states is just the same as the probability of the next state given only the current state. This is the Markov property. If we assume a cognitive Markov process, this means that the probabilities of my next thoughts only depend on the current thought, not on how I got there from previous thoughts. That is, if I for some reason got to think of the Beatles by another route than via yellow submarines, I would still be likely to think Lennon-Shot-Dead as the next sequence.

Whether cognitive processes satisfy the Markov property is perhaps questionable, but let’s stick to this assumption for simplicity since Markov processes and MCMC methods in statistical inference have many interesting properties which I think are relevant also for thinking, learning and neurological disorders.

So let us have a cognitive look at some of these properties.

Priming – Initial value dependence

In order to set up a Markov process an initial value must be given. This initial value is the ‘primer’ or anchor for the next thought. The effect of priming is well known and studied through many psychological studies. Priming describes how thoughts and decisions may be biased towards input cues. The cues may be more or less random or given deliberately to manipulate the cognitive response of the receiver of the cues. Priming is a widely used technique in commercial marketing where subtle messages are given to bias our opinions about products to increase the likelihood of us buying them, and social media marketing is now giving highly personal primers based on the information we provide online. In teaching such priming of thoughts through so-called flagging of headlines is a recommended trick to prepare the minds of the listeners before serving the details. For Markov processes the chain will forget its initial value after some time, and the effect of priming in psychology is similarly of limited time effect.

Focus level – Random walk step length and mixing level

For random walk processes the center of the distribution is typically the current value, but another important factor is the step length or variance of the distribution. If step lengths are short, the process moves very slowly across the state space only entering closely connected states. Furthermore, the series of visited states will show high level of auto-correlation, which in the cognitive setting means that thoughts tend to be similar and related. One might characterize a person with highly auto-correlated thinking as narrow minded, but we all tend to be narrow minded every time we focus strongly on solving some difficult task or concentrate on learning some new skill. Neurologically strong focus is induced by activation of inhibitory neurons through increased release of the neurotransmitter GABA (Gamma-aminobutyric acid) which reduce transition probabilities for long step transitions to irrelevant thoughts.

The problem with a slowly moving cognitive chain like this is the high likelihood of missing out on creative solutions to problems. If step lengths are allowed to increase (by reduction of inhibitory neuron activity) a more diffuse state of mind is induced, that facilitates creative thinking. However, too long step lengths may increase the risk of very remote ideas to pop up, only to be rejected as irrelevant in the current context. For long step lengths auto-correlation may be very low, and thoughts appear to be disconnected. Some persons suffering from attention-hyperactive disorder (AHD) may lack the ability to retain focus over long time due to having random walks of thought with too long steps. In statistical inference and MCMC estimation of some unknown probability distributions a so-called good mixing process is desirable, where the chain moves across the state space in intermediate step lengths, avoiding both being too narrow minded and too diffuse. Such good mixing processes has the largest probability of covering the state space in sufficient proportions within a limited time span. For cognitive processes the definition of good mixing will of course depend on the context, whether focusing or defocusing is most beneficial.

Thoughts suppression and Alzheimer’s – State space reducibility

A state space is irreducible if all states are accessible from any other state within a limited number of steps. If we for simplicity assume a static state space of thoughts and memory, this will be irreducible if any thought or memory can be reached by means of association from any other thought or memory. Of course our cognitive state space is not static, but reducibility of mind may occur in cases were memories are unconsciously suppressed and is never reached consciously (functional reducibility), or if connections to memories are lost due to damaged synapses (structural reducibility) like may happen due to Alzheimer’s disease.

Obsessive disorders – Periodicity of random walk

A Markov chain is periodic with period k if any return to a given state must occur in a multiple of k steps. A chain is said to be aperiodic if k=1. Aperiodic chains are usually desirable in MCMC, but in natural processes periodic processes may occur. It is reasonable to assume that cognitive processes are aperiodic, although some cognitive impairments like obsessive compulsive disorder (OCD) may show temporary periodicity where the patient does not seem to be able to snap out of a circular chain of thoughts.

Context dependent associations – Time-inhomogeneity

There might be times at which “submarine” would be an even more probable association from “yellow” than other times, for instance immediately after hearing the Beatles’ tune on the radio. This means that the transition probabilities are so-called time-inhomogeneous. Time-inhomogeneous Markov chains are often used in MCMC estimation when step-lengths (focus) are allowed to vary over time to optimize chain mixing. The inhomogeneity in cognitive processes is not only time dependent as means to adjust focus level (mixing), but the transition probabilities will also depend on context, place, and mood.

Plasticity and learning – Non-stationarity

This far we have for simplicity assumed that there is a fixed probability distribution across the state space of thought for an individual, and that the state space itself is static. This is characteristic for a stationary distribution in stochastic process theory. There is, however, no reason to believe that the cognitive state space is static, nor that the thought-distribution is stationary. This is due to the fact that we are all expanding and altering the state space through learning, and the brain is continuously changing, both functionally and structurally. The Markov chains of thinking are actually changing the probability distribution over the state space as it moves. This is because repetitively running chains of association is a key part of learning by which the transition probabilities are altered by the change of synaptic strengths of the association networks. Furthermore, new and previously unvisited thoughts occur during the random walk as result of creative thinking or learning from external input. Finally, non-visited parts of the state space may be eliminated (forgotten) through pruning of synaptic connections. The brain is very plastic and hence, so is the state space of thought. The very fact that the process visits these thoughts, increases the likelihood of a later revisit. Hence, the random walk of learning and creative thinking may be considered as a non-stationary stochastic process. If you think about it, this should be obvious. During our lifetime interest, values and the context we’re part of change, and this certainly reflect our thought processes.

Thinking and conversation – Hidden Markov Chains

As mentioned previously, to an outsider my thought process is hidden. In statistical inference Hidden Markov chains are used to model data where an assumption of an underlying stochastic process generating occasional observable output is reasonable. My thoughts are occasionally observable whenever I speak. Hidden MC’s are defined not only by transition probabilities for the hidden state space, but also state dependent probabilities for generating an observable output. Again, these output probabilities may depend on, for instance, the context and the personality. If I am on non-familiar ground, either literary or cognitively, I am less likely to express my thoughts. Furthermore, I am an introvert who are less expressive than extroverts. The cognitive process of an introvert is generally more hidden having smaller probabilities of generating output than the case is for an extrovert. This also has the implication that the outputs of an introvert may seem to be more disconnected and having small auto-correlations. Extroverts’ statements may on the other hand appear to have higher autocorrelation than those of introverts, and the latter group may easily get annoyed by extroverts saying the obvious and sticking too long with a topic.

Collaboration – Parallel chains

A common trick in order to monitor whether Markov chains have converged to their stationary processes, are mixing well and have forgotten their initial values, is to initiate several parallel chains with different initial values spread out across the state space. Comparing within and between chain variability gives information whether the mixing works properly and convergence has been reached. Further, parallel chains may faster cover the state space, and integrating the information from all chains yields quicker estimates of the properties of the state space distribution.

Parallel and hidden Markov chains interact in the context of a conversation at the lunch table, or during group based learning. Flipped classroom learning is an example of group learning which means that students see lecture videos at home preparing for group based learning and problem solving at school. The teacher operates more like a guide and discussant than a lecturer as he or she visits the groups. The homework prepares the student for the group learning process, and each group member joins the collaboration with their own hidden thought process with individual initial values. In addition varying experience and knowledge levels, interests values and personalities yields individual cognitive state space distributions. During the group process the parallel hidden random walks of thought evolve jointly towards a better understanding of the subject to be learned. Through conversation associations are exchanged which may lead to jumps in the hidden processes. These jumps can result in a better coverage of the state space and faster learning for each individual group member. The integration of information from multiple hidden parallel chains becomes effective through conversation and collaboration. At this point students’ personalities may influence the effectiveness of the learning process. Introverts have, as discussed above, smaller probabilities of generating outputs from the hidden chains compared to extroverts. Extroverts may therefore quicker get a correction of direction through the interaction with other extroverts, and this again may lead to faster convergence of thoughts than the case is for introverts who may get stuck sub-sampling limited thought regions for a long time.

Creativity – The parallel hidden chains of unconscious associations

Earlier I wrote that long step lengths of associations increases creativity, but the truly creative driver is probably the hidden and parallel processes of the unconscious mind. There is neuronal activity even in a resting brain and in brain regions that are not monitored by our consciousness, not only the sub-cortical regions and the cerebellum, but also in cortical regions which are outside the focus area of our consciousness (see my previous post …). Even if the signaling processes in these regions are hidden to us, they are likely to walk along the paths of highest transition probabilities. Furthermore, the unconscious random walks of associations are not restricted to be followed by our conscious attention (which is univariate). Hence, there may be multiple parallel hidden chains running in the unconsciousness. This may explain why the unconscious is such an effective problem solver and generator of creative thought. Sometimes the hidden processes produce coherent sampling which induce conscious attention by generating an (to our consciousness) observable output, an a-ha moment (This very idea was in fact served me from my unconsciousness right before going to sleep after writing the introduction to this post). How the unconscious processes contribute to a conscious experience and attention was among the topics for my previous post The statistics of effective learning.

In this post I have presented some similarities between statistics and cognition, and once more it seems like nature thought of it first. However, statistical knowledge may give some new insight and understanding of cognitive processes, as discussed here.


The post The random walk of thought appeared first on Metacognition.

]]> 0
The statistics of effective learning Thu, 25 Aug 2016 09:18:17 +0000 Throughout your days in school, and maybe also in college and at the university, you have had a lot of different teachers and lecturers. Most likely you also have a […]

The post The statistics of effective learning appeared first on Metacognition.

Throughout your days in school, and maybe also in college and at the university, you have had a lot of different teachers and lecturers. Most likely you also have a favorite among these, a really good teacher who managed to completely capture your attention and focus. I had a math teacher once to whom the entire class payed close attention, in every lesson, but how did he do it? I guess he just had the talent for it, but what is the secret behind a really good lecture or an excellent presentation?

A place to start is to learn from experience lecturers, and if you want to create and give great presentations, there is a lot of pedagogical advice out there, from textbooks in teacher’s education to dedicated presentation services on the internet. Here is a little list of some of the advice that I found from some websites like and which I have organized into four main points.

  • Make structured presentations, minimize text on slides
  • Use storytelling, relevant visuals and other multi-media sources, move your hands
  • Speak in short burst and vary your vocal level, include elements of surprise
  • Dress properly, smile and make eye contact

As a quite experienced lecturer in statistics at the Norwegian University of Life Sciences I am familiar with these advice, and although I still have a lot to learn, I have experienced that they do help to keep the attention from the students. But, why do they work? In this post I will address aspects of statistics, information theory, consciousness, attention and neuroscience which may help answer this question.

Measuring information

First of all, it is a matter of signal and noise, of course. That is what statistics is all about, and learning. From a varying, and at first apparently chaotic input, learning takes place when we discover some new order, a higher level of categorization, which can explain parts of the observed variation. It is my challenge as an educator to help my students bring order into chaos.

As I teach I provide input signals to my students. The perceived information content of an input signal may be measured or represented in different ways. In statistics we often use some signal-to-noise ratio, like the Fisher test statistic in Analysis of Variance (ANOVA). At any given time we are in possession of beliefs and theories about the outer world. If our belief about the world that generates incoming signals is plausible, the world appears to be to some extent predictable. Our inner model is good, and the unexplained part of what we observe (the noise) is small, which results in a high signal-to-noise ratio. On the other hand, if the input appears to be surprising and unpredictable, we have still something to learn about the processes that produce the input. Our inner belief model is poor, and a large part of what we observe appears as noise. According Hohwy (2013) life (and learning) is a continuous process of prediction error minimization by model (belief) adjustments.

Another measure of information content is the concept of mutual information, which is the difference in entropy (degree of disorder) between two variables, like the input from the world that we receive through our senses on one hand, and our predictions of the world generated by our current beliefs on the other. If these variables are similar, then the level of mutual information is high. The mutual information is linked to prediction error by the fact that the mutual information increases as the prediction error decreases.

Chaos and order

The more we learn, the better our inner models of the world become, and the better we can predict. We are simply less surprised by what we observe! The mutual information between the content of my lecture and the students understanding of it increases as order is brought into chaos. As a lecturer I must help my students to adjust their inner prediction models in such a way that they steadily become less surprised by what I say. When most of what I say appears to them to be predictable, the students are ready for the exam!

The goal is order and stability (predictability), but as Van Eenwyk (1997) states:

“.., whenever stability is a goal of adaption, chaos’s contribution rivals that of order. From that perspective, order and chaos reflect one another. Just where the line between the two exists depends largely on our ability to recognize similarities in the chaos.”

Hence, a good learning process is shifting constantly between chaos and order, then new chaos followed by new order, and so on. If I as a lecturer do not manage to create new chaos, my presentation becomes boring, fully predictable and unnecessary. There is no new order to be set. A good presentation should therefore strive to be somewhat unpredictable, because it is the discrepancy between the students’ beliefs and what I actually say, the surprise, which carries the potential of new learning.

Conditions for effective learning

I just mentioned attention, which of course is a necessary condition for learning, and another necessary condition is consciousness. A main message in this blog post is the following:

  • A good learning process should raise the level of consciousness and trigger focused attention.

Sounds simple enough, and the list of good advice I presented earlier may help accomplish this. Let me first elaborate a bit on consciousness and attention before we discuss again the presentation advice I found.


Even though psychologists have shown that we may pay attention even if consciousness is low, and the opposite, that we can be quite conscious, but don’t pay attention (Hohwy, 2013), it is the combination of high consciousness AND high attention which is required for optimal learning conditions.

Consciousness, what is it really? We all have a first hand experience with it, yet it is so hard to explain, but new theories have been suggested, among which I find the integrated information theory (IIT) of Tononi (2008) and co-workers as the most promising. Put short the theory is based on the level of integrated mutual information inherent in a neural network. The model explains how a central part of a network, for instance the frontal cortical region, may reach above some imagined critical level of integrated information, which is believed to give rise to consciousness as an emergent property of the network.

[An emergent property is according to the Stanford Encyclopedia of Philosophy a state that “‘arise’ out of more fundamental entities and yet are ‘novel’ or ‘irreducible’ with respect to them.” In our case we might say that consciousness is a state which cannot be predicted from observing the reduced properties of the neurons and the neural network.]

What is really interesting by the IIT of consciousness is that the level of consciousness in a central cortical network may depend on and even be strengthened by connected sub-networks, such as sub-cortical circuits or cortico-cerebellar pathways, which will remain in the unconscious as long as these sub-networks are sufficiently segregated (loosely connected). This can for instance explain why we are not conscious of processes controlled by, for instance, the brainstem (e.g. heartbeat, breathing, body temperature control) and the cerebellum (like automated body movements). Neither are we directly conscious (luckily) of the immense data processing and data filtering going on in the sensory cortical regions like the visual cortex. However, the filtered information from the unconscious regions are gathered, integrated and potentially bound together in the central and “conscious part” of the neural network.

IcebergGraph3We may envision the conscious part of the network as the part of an iceberg which is visible above sea level, and the unconscious network regions as the large lump of ice below the surface supporting the conscious experience. The sea level is the critical level of mutual information required for consciousness to occur.

The iceberg above sea level represents the potential focal points that can be sampled by our conscious attention, as I also wrote about in the post “You random attention“. Further, and as explained by Hohwy (2013) we tend to choose an attention point which is surprising to us, that is, with a high prediction error.

Just think about it, imagine that you sit at a lecture listening to a not so engaging presentation about a well known subject. If something unexpected suddenly happens, like a cell phone ringing, your attention will immediately be drawn towards the surprising element.

So, here comes the element of surprise again. If a lecture is somewhat unpredictable, it will help to draw attention to it. But there is even more to this story. According to Friston (2009) attention is drawn strongest to a focal point with a combination of high prediction error (surprise) AND expected high information quality (precision).

Thus, our attention is drawn towards objects or actions that represent high level of surprise, which we in addition find to be reliably observed or to come from a trusted source. For example, we are more likely to pay attention when a highly respected person than a mere common takes the stand to give a speech.

Varying levels of consciousness

I guess you have seen, either with your own eyes or from pictures, icebergs of different shapes. Some are peaked, reaching relatively high above the surface, whereas others are just flat, barely visible above water. The first kind of iceberg may illustrate a high conscious state where the unconscious inputs are congruent and may be integrated into a strong, unified conscious state of mind.

Maybe you have some vivid memories from your youth or childhood? I also have some, and when I think back now, I can describe the given moment vividly, both what I saw, heard, and felt, and maybe also smell or taste was involved. Chances are also that such vivid memory experiences triggered some emotions, either good or bad. I think the reason I remember these moments so strongly is the fact that all unconscious inputs to the information integration network were congruent and could strengthen the conscious experience by adding different dimensions to it. My brain found the strongest focal point to attend to and the moment was glued to my memory!

The other type, the flat icebergs, represent the opposite. We are awake and conscious, but the input signals are either weak, noisy an unreliable or non-congruent and distracting. Think about the non-engaging lecture again… Attention points are hard to find or are very unstable on this flat surface. The conscious experience is weak and you are on the border of unconsciousness, maybe half a sleep or daydreaming. This state of mind is good for thought wandering, creativity and free association, but not optimal for learning “facts” from external stimuli.

Let us sum up what we have so far. In order to create a good learning experience through instruction a lecturer should:

  1. Increase attention: Keep an element of surprise, avoid being predictable and monotone, and be precise and trustworthy.
  2. Increase consciousness: Be congruent in the way that all inputs to the students add to a unitary, integrable experience.

Building a unitary, focused learning experience

This latter point is very important, but under-communicated. Sadly, I have seen so many presentations, especially at scientific conferences, were slides crowded with text are flashed before the eyes of the audience while the speaker tries, with good intentions, to express the same message in other(!) words, maybe with a non-engaging body language. What really happens is that the speaker conveys simultaneous, but non-congruent messages, one visual-lingual and another auditory-lingual. The brains of the audience then fail to integrate the information, the signal-to-noise ratio becomes low, and the body language of the speaker reduces the signal content even further. The top of the iceberg is flat and the audience enters a state close to unconsciousness. Some at the back row may even reach it entirely…

A small side comment fits well in here, because simultaneous non-congruent input is, in fact, also a type of communication, but a type which has been used for inducing hypnosis (!) by professional psychiatrists, like the famous Milton H. Erickson (Grinder and Bandler, 1981). A hypnotic state may be induced by simultaneous over-loading multiple senses for some time, until the hypnotist offers the patient to enter unconsciousness as an escape from the distracting and exhausting condition.

This is NOT what I would like my students to experience! Still, I fear that many lectures given at universities have similar hypnotic effects in the way students are over-loaded with non-congruent and non-integrable information!

Body language and voice

I mentioned non-engaging body language as a distraction in the above example, but “Moving your hands” was on the list of good advice, and moving your hands and a rich body language may increase the attention to the presentation if used properly.

I recently came across a post on the TED-blog by Alison Prato. The sub heading of the blog was “All TED Talks are good. Why do only some go viral?” There Vanessa Van Edwards, the founder of the consultancy called Science of People, gives some clues as to why some TED-talks become more popular than others. In their study hundreds of TED-talks were rated by 760 volunteers, and it is interesting to see how well the results fit the points I have made here.

Firstly, the most popular lecturers were by a test panel found to be more credible and having higher vocal variety. Edwards says:

“We found that the more vocal variety a speaker had, the higher their charisma and credibility ratings were. Something about vocal variety links to charisma and competence.”

Secondly, the lecturers that go viral had a richer body language than others. Edwards says, that this was a bit surprising: “We don’t know why, but we have a hypothesis: If you’re watching a talk and someone’s moving their hands, it gives your mind something else to do in addition to listening. So you’re doubly engaged. For the talks where someone is not moving their hands a lot, it’s almost like there’s less brain engagement, and the brain is like, “This is not exciting” — even if the content’s really good.”

I think they are onto something here, but according to my hypothesis, the body language must not only be rich, but it must also be congruent with the spoken word. Random hand-waving is distracting and lowers consciousness. Also, high vocal variety may be distracting and lower the impression of competence if used wrongly.

You can read the entire TED-blog by Alison Prato here.

The salience network for attention and prediction error minimization

Thus far my arguments for effective learning conditions have been quite theoretical and based on information theory, but there is also neuroscientific support for this. In later years neuroscience has shifted from a quite modular view of the brain, where functions (and malfunctions) are tightly connected to specific brain regions, to a view where brain functions are results of network processes and communication between regions. A series of functions and pathologies have lately been connected to networks and their properties (see e.g. Sporns 2011), among which attention has been connected to the so-called salience network (Menon (2015).   This is a frontal lobe centered network, which includes brain regions like the dorsal anterior cingulate cortex (dACC), the anterior insula (AI), amygdala (A) and the ventral striatum (VS). Menon (2015) describes this network as

“…a high-order system for competitive, context specific, stimulus selection and for focusing ‘spotlight of attention’ and enhancing access to resources needed for goal-directed behavior”.

The network is connected to sensory cortices for external inputs, to the AI for self-awareness, to the emotion center of A and context evaluation in VS. The dACC is believed to be the center for evaluating surprise, or prediction error. This means that this network carries all necessary constituent needed to serve as a conscious attention network, and combined with the theory of Tononi, a high conscious state allowing strong attention, may be induced if all input information is congruent, and if the input fits emotion and interests and feels relevant, it is even better.

Advice explained

It is time to return to the list of good advice for presenters. Based on the theories of consciousness, attention and prediction error minimization as we have just worked through, we should be able to put some explanation to why these are good advice.

Here is the list again with explanation to why they improves learning:

  • Make structured presentations, minimize text on slides
    • Brings order into chaos, helps categorize
    • Increases signal-to-noise ratio and mutual information
    • Increases consciousness and attention
  • Use storytelling, relevant visuals and other multi-media sources, move your hands
    • If done properly, adds extra dimensions to the learning experience
    • Congruent sensory input
    • Unitary and integrable experience
    • Increased consciousness and attention
  • Speak in short burst and vary your vocal level, include elements of surprise
    • Increasing prediction error among the listeners
    • Increased consciousness and attention
  • Dress properly, smile and make eye contact
    • Increases trust
    • Increases expected information quality (precision)
    • Increased consciousness and attention

Personalized learning

However, it may not be as simple as this. If you ask people in the audience after a lecture whether they liked a presentation or not, you may receive quite different answers. This may be because we are all born with quite personal brain networks. The connections and their strengths in the network differ from one person to another, and we all integrate information in our own personal manner. Some may, for instance, place more emphasis on the emotional part via strong connections to the amygdala as they integrate the incoming information in the salience network, than others do.

I will not spend time on personality theory here, but the links to Walter Lowen (1982) and his entropy based model for personalities and even consciousness is apparent, and I may draw that link in a later blog post. I will here only remark that effective learning probably depends on the actual topology of the attention network of each and every student.


I’m very grateful to my good friend and colleague Dr. Helge Brovold at the National Centre for Science Recruitment in Norway for our many good chats about these topics and for his willingness to share his knowledge and interesting books with me.


Friston, K. (2009). The free-energy principle: a rough guide to the brain?. Trends in cognitive sciences, 13(7), 293-301.

Grinder, J. and Bandler, R. (1981). Trance-formations: Neuro-linguistic programming and the structure of hypnosis. Real People Pr.

Hohwy, J. (2013). The predictive mind. OUP Oxford.

Lowen, W. (1982). Dichotomies of the Mind. Wiley & Sons, NYC

Menon, V. (2015). Salience Network’. Brain Mapping: An Encyclopedic Reference, Academic Press: Elsevier.

Sporns, O. (2010). Networks of the Brain. MIT press.

Tononi, G. (2004) An information integration theory of consciousness. BMC Neuroscience, 5:42.

Van Eenwyk, J.R (1997). Archetypes and Strange Attractors. Inner City Books, Toronto, Canada.




The post The statistics of effective learning appeared first on Metacognition.

]]> 1
A mathematical view on personality Thu, 10 Mar 2016 12:19:43 +0000 Both personality and consciousness were properties of the human psyche that were central to the thinking of Carl Jung. He observed the stability of certain personality types among people, but […]

The post A mathematical view on personality appeared first on Metacognition.

Both personality and consciousness were properties of the human psyche that were central to the thinking of Carl Jung. He observed the stability of certain personality types among people, but also their complexity and unpredictability. He defined the archetypes as these stabilities, and the complexities and unpredictability he dedicated to the constant interplay between the conscious self and what he termed “the shadow side” of the personality.


(Carl Jung, by PsychArt/ CC BY / Desaturated from original)

The ideas of Jung have been debated, and other personality traits than his archetypes find broader support today, but new support to Jung may now come from a mathematical perspective.

I will give a short glimpse of a mathematical hypothesis for personality here that I intend to elaborate on elsewhere later.

Within computational neuroscience mathematical theories have contributed a lot to increase our understanding of how brains work, not only at a neuronal level, but also at network level, for instance, how memories are stored and recalled, and how we associate and make decisions (See e.g. Rolls and Deco, 2010). Interestingly, more diffuse properties of the human psyche, like personality (Van Eenwyk, 1997) and consciousness (Tononi, 2004) may, in fact, be connected to mathematical properties of networks, and in this post I will focus on what mathematics can teach us about these matters.

In mathematics there are complex models for information transfer across networks called attractor networks, and the neural network of our brain appears to be well approximated by these models. I have already touched upon these attractors in a previous post on creativity, because creativity is linked to the ability to easily move from one attractor state to another along new or alternative paths.

Attractor networks are built from nodes (for example neurons) which typically are recurrently linked (loops) with edges (like synaptic connections), and the dynamics of the network tend to stabilize at least locally to certain patterns. These stable patterns are the attractors. For example, a memory stored in long time memory may be considered as a so-called point attractor, a subnetwork of strongly connected neurons.

The point attractors are low-energy states in an energy landscape with surrounding basins of attraction, much like hillsides surrounding the bottom of a valley, as shown in the figure below. Random perception signals are like rainfall finding its way down to the closest attractor, leading to a thought, a memory recall, an association or a decision to react.


(Point attractors in a network, by Eliasmith / CC BY )

Also other types of mathematical attractors exist, like line-, plane- and cyclic attractors, and these have been used to explain neural responses like eye-vision control and cyclic motor control, like walking and chewing (Eliasmith, 2005).

Common to these attractors are their stability and predictability, and this is good with regard to having stable memory and stable bodily control, but what about personality? Is personality also an attractor? Do we all have our basins of attraction, which pulls our personality towards stable behavior?

Probably yes, but if you think about it, personality is a more unpredictable property than memory and body control. We think we know someone, and the suddenly they behave in an unexpected manner. Still, the overall personality seems to be more or less stable. How can something be both stable and unpredictable at the same time?

Well there is another class of attractors that may occur in attractor networks. These are the strange (or chaotic) attractors, and they are exactly that, partly stable and partly unpredictable. We say they are bounded, but non-repeating.

A famous example is the Lorenz attractor discovered by Edward Lorenz while he was programming his “weather machine” where typical weather patterns appeared, but never repeated themselves. In the figure below the blue curve is pulled towards the red strange attractor state, and once it enters the attractor, it is bound to follow a certain pattern, though it never repeats itself.



(Lorenz attractor in a network, by Eliasmith / CC BY )

The discovery of strange attractors led to the development of chaos theory and fractal geometry in mathematics. Many phenomena around us may develop smoothly in linear predictable fashions until a certain border is reached, at which point a chaotic state appears before a new order may be settled. Just think of water being heated towards boiling temperature. A chaotic state occurs just as there is a transition from liquid state to vapor.

Some scientists now believe that the transition from unconsciousness to consciousness may be a similar transition between states. The mathematical model of consciousness proposed by Tononi (2004) is based on the assumption of the capacity of a network to integrate information. If the level of information integration crosses a certain limit, a new and emergent state is entered, consciousness. This corresponds to a fundamental change in the property of the network as a whole, and I think we can all agree upon the fact that there is a fundamental change between being asleep and being awake. There is no linear transition between the two.

What about personality?

Well, a person’s personality is of course most apparent in our most conscious state, where we act in a dependable and responsible manner, although there is also an unconscious side to it, according to Jung, and it is often referred to as the “shadow” of our personality. The shadow side of our personality, residing in our unconscious, is according to Jung, balancing our conscious side, and it also holds the key to relaxation, serendipity and creativity, but also to the irrational behavior in stressed situations. This was also recognized by Walter Lowen (1982), a systems science researcher on artificial intelligence who developed a model for personality that reached far beyond the simple dichotomies of Carl Jung.

To many, this may just sound like psychological thinking, but if we combine the theory of strange attractors from chaos theory with the model of consciousness by Tononi, I think we may find theoretical support to the thoughts of Jung.

According to Tononi, consciousness depends on both information integration and information segregation. Loosely speaking, consciousness is generated by a “central” network complex with a high capacity for information integration, whereas other, but connected sub-networks containing specific and segregated information, may contribute without actually being part of the central network complex.

A good example is the Cerebellum, which contains more neurons than the entire remainder of the brain (the Cerebrum). Still, the activity of the Cerebellum is totally unconscious to us. The Cerebellum contains segregated information and procedures, like books and instruction manuals in a library. This information may however be integrated by other parts of the brain, combined with input signals from our senses into a conscious experience.

The main point here is that certain parts of the brain, and certain circuits, are more involved than others in what we can call the conscious complex of the brain. The other parts are connected and contribute, but work quietly in the “shadow”.

The way information is integrated is still unknown, but it may very well be in the form of strange attractors, cycling through regions of the brain in somewhat unpredictable, yet bounded manners. Some attractors work within the conscious complex, others work in connected, but unconscious parts.

I believe that personality depends on the properties of these strange attractors, and how these attractors are distributed, either within the central conscious complex, or in the more peripheral unconscious network space!

For instance, for the personalities Jung characterized as “Feeling” (correlated with being “Agreeable” in Big Five theory) the ventral circuits involving the sub-cortical brains parts like the Amygdala, may be part of a strange attractor within the conscious complex, whereas personalities being “Thinking” (or less “Agreeable”) may have more dorsal conscious attractor states.

This can serve as a theoretical explanation to why some people tend to base their decisions more on values and emotions than others who tend to make more impersonal decisions based on logics.

Other personality traits are also possible to explain by this theory.

Personalities are strange attractors, unpredictable, never repeating, but bounded. People have similar types, yet we are all being different. In the mathematics of chaos and strange attractors this is known as the sensitive dependence on initial conditions. It was this that lead Lorentz to discover the chaotic property of his weather machine. Others have referred to this as the butterfly effect, how the minimal effect of a butterfly flapping its wings can through a cascade of causalities result in a storm on the other side of the world.

We are all born with different initial conditions, different families, different environments, different experiences. We may be born with the same set of strange attractors, but we will never be equal, only similar, since we are bounded within the same basins of attraction.

It is this individual unpredictability that makes it so hard to understand the human mind. The only thing we can do is to use statistical models like the Big Five (factor analysis) or Jung/Lowen (dichotomous classification) to try to separate the stable properties of the attractors from the unpredictable chaotic variation among individuals. However, new brain scanning technologies now provide unprecedented possibilities to go beyond Carl Jung and to study the strange attractors of the brain by multivariate statistical meta-analysis in order to get a better grip on what makes us similar, yet different.

I think Carl Jung was not far off!




Rolls E.T and Deco G. (2010) The noisy brain. Oxford University Press Inc, New York.

Van Eenwyk, J. R (1997) Archetypes & Strange Attractors – The Chaotic World of Symbols. Inner City Books, Toronto, Canada.

Tononi, G. (2004) An information integration theory of consciousness. BMC Neuroscience, 5:42.

Eliasmith, C. (2005) A Unified Approach to Building and Controlling Spiking Attractor Networks. Neural Computation, 17(6), 1276-1314.

Lowen, W. (1982) The dichotomies of the mind. Wiley and Sons, New York.



The post A mathematical view on personality appeared first on Metacognition.

]]> 6
The statistics of free will Fri, 29 Jan 2016 14:48:19 +0000 “The one to the right”, I responded. My wife was asking me which curtains I would prefer to buy for our living room. I must admit it, I did not have […]

The post The statistics of free will appeared first on Metacognition.

“The one to the right”, I responded.

My wife was asking me which curtains I would prefer to buy for our living room. I must admit it, I did not have much preference for either, both were nice and my answer was quite random. Maybe one of them gave me some triggering association guiding my choice, or the decision was the random act of a single neuron firing in my brain. Anyway, I usually give little thought to curtains.

I have in a previous post, Be creative – Use your noisy brain, written how decision making, alertness, creativity, in fact depend on a certain randomness in our brain signals. Our neurons are constantly sending signals at a certain rate to other neurons, even in a resting state, and it turns out that this random and noisy firing of the neurons has many positive effects on us (in addition to some less beneficial), for instance, it may help men choose curtains…

Does this mean that I lack free will? Are all my decisions just results of random neuronal firings?

What about the input signals from my senses, leading to my perceptions of the world. Are they random or deterministic? In another post, Your random attention, I wrote how our unconscious selves filter and alter our perceptions of the world in a bottom-up manner, from, to us seemingly random input signals to pre-processed data, and how our personalities, our emotions, our inherent values, and interest change the way we perceive the world.

Raw input signals from the senses are sent from the sensor cortical areas towards the prefrontal cortex (where our working memory is), but contextual information from the hippocampus and the basal ganglia and emotional information from the amygdala is changing the way the sensory data are perceived.

This data pre-processing step is hard-wired in the brain in neural networks with synaptic connection strengths based on learning from previous experiences and decisions. From all incoming and pre-processed data we sample at any given moment a point of attention as, described in Your random attention.

So there certainly is a level of randomness to what we perceive and become consciously aware of, and so far I have presented little evidence of free will, because even our attention point is randomly sampled.

Do we have free will?

This question has troubled philosophers through centuries, and even now in a time where neuroscience and neuropsychology make astonishing discoveries of neuronal functioning, science philosophers do not quite agree.

Is everytFullSizeRenderhing random? Or on the contrary, is it all faith?

According to the deterministic reductionists the worlds is deterministic, down to the interaction of elementary particles, and there is no other causality than that found at this fundamental level. Tse (2013), whose opinion is another, state on behalf of the deterministic viewpoint that:

“If determinism is true, we have only one possible course of action at any moment, so we would not have acted otherwise and have no alternatives to choose from.”

Hence, a world without random events is incompatible with the notion of free will. What about the other extreme? Is everything random?

The advocates of pure in-determinism claim that free will is non-existent due to a purely stochastic world. It really sounds logical, if everything were random, all input signals to our senses would be random, all signal transmissions between our neurons would be random, and, hence, so would all our decisions be!

This view gained increased strength by the publishing of results indicating that our unconscious make our decisions for us even before we become consciously aware of them (Soon et al, 2008). Are we fooled to believe that we are in conscious control over our actions, whereas the unconscious makes the decision for us, and perhaps on a purely random basis?

So, where does this leave us? Neither pure determinism, nor pure in-determinism allows free will to exist. Are we in conscious control of our decisions? Even philosophers like David Hume did not see any solution to this problem.

Is there any “in-between” pure determinism and pure randomness, which allows free will?

Yes, there is statistics…

Free will is “top-down” statistical categorization of future “bottom-up” random input signals. That is, there are sequentially updated classification models, which map random perceptions into consciously pre-determined categories, like types of decisions and behavior. Tse (2013) refers to this as criterial causation in his highly recommendable book “The neural basis of free will”.

To be more specific, decision networks are assigning input signals (perceptions) to categories according to preset criteria (coded by synaptic weights). These criteria have previously been consciously defined in a top-down manner and are modifiable for the future.

For example, if the perceived input patterns meet certain criteria, we have decided to act so and so. This is free will! We consciously choose the criteria, which are to be met by future input signals for the various decision outcomes.

Recent research (Schultze-Kraft et al, 2015) indicates that we may change our criteria for decisions down to 200 milliseconds into the future. This is a result which more or less contradicts the previous experiments indicating that our unconsciousness decides even seconds before we become consciously aware of our decision, which led to enforced doubts about free will.

However, the time interval for alteration of the decision criteria is not important for the existence of free will, only the fact that we can change the criteria for future inputs, since this sets the strengths of the connections in the decision network.  In a way we may consider the criteria of the decision network as “accumulated free will”, since these have been constantly modified throughout our lifetime, or as long as a specific kind of decision has been relevant.

Free will is thus existent as a statistical classification model based on neural networks. It transforms random input into preset decision categories, which we constantly optimize to meet our standards.


“But, I think the gray curtains match better the furniture”, my wife replied.

With this new and relevant information I suddenly got an opinion about curtains, taking either a decreased, or increased liking to gray curtains.  I updated my curtain categories for future visits to the store, and I made a conscious decision of how to alter the weights of my statistical model.

We have gray curtains in our living room, by the way…


Tse, Peter. The neural basis of free will: Criterial causation. Mit Press, 2013.

Soon, C.S., Brass, M., Heinze, H.J. and Haynes, J.D., 2008. Unconscious determinants of free decisions in the human brain. Nature neuroscience, 11(5), pp.543-545.

Schultze-Kraft, M., Birman, D., Rusconi, M., Allefeld, C., Görgen, K., Dähne, S., Blankertz, B. and Haynes, J.D., 2015. The point of no return in vetoing self-initiated movements. Proceedings of the National Academy of Sciences, p.201513569.


The post The statistics of free will appeared first on Metacognition.

]]> 0
Be creative – use your noisy brain Fri, 18 Dec 2015 14:10:25 +0000 “Noise” The word gives associations, like bad or loud music, or perhaps the sound of rush hour traffic.  Whatever you think of when you hear the word, it is likely […]

The post Be creative – use your noisy brain appeared first on Metacognition.



The word gives associations, like bad or loud music, or perhaps the sound of rush hour traffic.  Whatever you think of when you hear the word, it is likely something you would do very well without. It is non-relevant or uninformative signals reaching your senses. You may think that noise is no good.

As a statistician I deal with noise on a daily basis, but another kind of noise. As I wrote in an earlier post, statistics is all about separating signal from noise in observable random variables. Noise is the non-explainable part of the observed variation, whereas the signal, the explainable part, represents information. We may say that noise is the opposite of information.

Just think of temperature, a variable phenomenon. Some of this variation is explainable, and the weather forecast uses this information to predict the temperature for tomorrow. However, there is always some variation that can not be explained. That’s statistical noise!

So, there is a certain randomness in noise, and randomness is maybe something you wouldn’t think of when we talk about brains and thoughts. You probably think that you are in control of your own mind?

Think again.

No, your brain is noisy! In fact, it produces its own random noise signals!

Just to make it clear right away. The noise is good for you in many ways. For instance, it can make you more alert and react faster, it can help you make decisions when you really don’t know what to do, and…it helps you to be creative, if you allow it…

Lean back a moment and think of a beautiful scenery, like the green hills of the Lake District in England or an alpine region of the Himalayas. The landscape is not flat, but you may take a hike and walk from one valley to another. Of course, it takes more effort in the Himalayas than in the Lake District to cross from one valley to another.

Your brain network is like a landscape like this. Your thoughts, potential decisions and memories are like valleys in an energy landscape. Pathways of strongly connected neurons form low-energy valleys in an energy landscape (Hopfield, 1982). Mountains or hills form barriers between your separate memories or thoughts, helping you to keep your thought straight.

Neurons pass electrical signals to one-another, and the signaling across the brain network may me mathematically described by a system of differential equations expressing the signal flow. In mathematical terminology the low-energy states of memories are attractors and the hillsides form the basins of attraction. Just as falling rain will flow down the hillsides to the bottom of the valley, neuronal signals entering the basin of attraction tend towards the attractor state.

This is what happens as you recall a memory. An association or a seemingly random input signal may create signals within a basin of attraction, and as the signals flow towards the attractor, the memory is recalled and potentially strengthened through so-called recurrent connections between the neurons representing the memory.

So, what about the noise, where does it enter this picture?

As described in the excellent book “The noisy brain” by Rolls and Deco (2010), by which I have been inspired to write this post, the neurons in your brain are constantly sending spontaneous signals even when they are “resting”. They refer to this as the spontaneous state, and for a set of resting neurons the neural firings may be approximated by a Poisson distribution, and the average firing rate is about 10 spikes per second (may vary between brain regions). Hence, neurons are sending noisy signals even when they are “off work”!

For neurons in an activated and stable attractor state the firing rate is higher, for instance, 100 spikes per second.

The figure below show two Poisson distributions with different mean firing rates, one for the spontaneous low-rate state and another for high-activity attractor state.





Typically, a neuron will generate an electrical pulse (an action potential) to be forwarded to its output connections, if there within a limited time interval is sufficient excitatory signals from input connections from other neurons. This is known as the integrate-and-fire-model (Lapicque  1907). This is a kind of all-or-nothing response. If the sum of input signals is too low, nothing happens, but as soon as the sum crosses a threshold, the electrical pulse is generated.

The signal, which is passed on, may for instance, trigger a response in your motor cortex making you physically react to sudden changes in your surroundings. Maybe, you need to make a sudden maneuver to avoid hitting an animal jumping in front of your car! In that moment the spontaneous noise from the resting neurons may have made you more alert since they tend to bring the sum of signals closer to the firing threshold. Hence, your reaction time was lowered, and you became more alert due to this inherent noise in your brain.

When you let your thoughts wander, you move around in the landscape, one valley to another, like a hiker. One thought serves as an association to another, and the signals move between the attractor states. Occasionally, though, random firing of neurons may distract your thought, bringing you from one valley to another. However, this is more likely to happen if you are hiking in the Lake District, where only low hills separate the valleys, than in the alpine Himalayas with tall mountain ridges.

This serves as a perfect analogy for what Barbara Oakley (Oakley, 2014) referred to as diffuse mode and focused mode in thinking. In a focused mode the basins of attraction are deep and you are not so easily distracted, because you are really focused on something. On the other hand, in the diffuse mode random and spontaneous firings may easily bring you from one basin to another, and day-dreaming is a typical diffuse mode example.

Creative persons, like the inventor Thomas A Edison, and many outstanding scientists have deliberately used the diffuse mode to enhance their creativity, because creativity is the ability to make new combinations of old ideas. Mental de-focusing, by taking a rest on a couch, a shower, or a walk in the park takes you from the Himalayas to the Lake District, and the random firing in your brain, the noise, is more likely to give you random associations, and new ideas may arise. Like I wrote in my previous post “Google pedagogics” there are evidence that also the cerebellum is highly involved in this creative process, but exactly how is not completely understood yet. It is however interesting that the Purkinje cells of the cerebellum show very high spontaneous firing rates, indicating a high noise level.

In statistics the signal-to-noise ratio is a well known concept. The familiar ANOVA (Analysis of Variance) is a statistical approach, which uses this principle to identify information in the observed variation of a random variable. As is known from ANOVA theory, information is easy to find if the signal-to-noise ratio is high (gives large F-test statistic), but difficult to find if the signal is weak compared to the noise (small F). The signal-to-noise ratio can directly be illustrated by the depth of the valleys. Deep valleys and high mountain ridges parallels high signal-to-noise ratio, and vice versa.

So, we have now seen that a “noisy brain” is not too bad to have! A third example of this is also described by Rolls and Deco, namely the way spontaneous firing may help you to take a decision when you’re stuck between two options. The decision-networks of your brain can represent possible decisions as attractors as well, and like a ball balancing on the ridge between two equally deep basins, you may feel “dead-locked” not knowing what to choose. Often the decision is made on the basis of random external input through the senses, but also random firings in the spontaneous state from neighbor neurons may cause the ball to roll down into one of the decision valleys. Hence, random noise may also help you make probabilistic decisions! This may also explain how we sample our attention points from the posterior point distribution, as I wrote about in the post “Your random attention”.

These are only some of the positive effects of internal noise in our brains. There are of course also other effects of noise, some of which are less positive. The concepts of noise and the signal-to-noise ratio may help explain decreased short time memory concentration and long-time memory formation with aging, the sometimes irrational behavior of stressed persons (the “shadow side” of our personality, as Carl Jung wrote about), mental disorders like Schizophrenia, or simply the lack of focus characterizing people with ADHD.

I will end this post with an open question which I will return to and discuss in a later post.

If our brains are non-deterministic, noisy and decisions may come as a result of random firings of neurons…

…do we have free will?



Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences79(8), 2554-2558.

Lapicque, L. (1907). Recherches quantitatives sur l’excitation électrique des nerfs traitée comme une polarisation. J. Physiol. Pathol. Gen9(1), 620-635.

Oakley, B. (2014). A Mind for Numbers: How to Excel at Math and Science (even If You Flunked Algebra). Penguin.

Rolls, E. T., & Deco, G. (2010). The noisy brain. Stochastic dynamics as a principle of brain function.(Oxford Univ. Press, UK, 2010).

The post Be creative – use your noisy brain appeared first on Metacognition.

]]> 1
Polanyi and the neural networks Wed, 11 Nov 2015 10:10:24 +0000 I was recently introduced to the thoughts of the science philosopher and scientist Michael Polanyi by Dr. Helge Brovold. As I was reading Polanyi’s book “The tacit dimension” (1966), which I borrowed from […]

The post Polanyi and the neural networks appeared first on Metacognition.

I was recently introduced to the thoughts of the science philosopher and scientist Michael Polanyi by Dr. Helge Brovold. As I was reading Polanyi’s book “The tacit dimension” (1966), which I borrowed from Helge, I realized that Polanyi’s description of tacit knowledge had a statistical counterpart, which I write about here. As it turns out this statistical detour may teach us how tacit knowledge is  implemented in our brains. Read on if this sounds interesting!

The human brain is an incredible pattern recognizer. Just think about it; among the hundreds or thousands of faces you see every day, most of which just pass by unnoticed, you recognize a familiar face within a fraction of a second. But, do you really know how you know a face? Can you describe the face of an acquainted person without seeing him or her? Probably not, you just know them when you see them. Michael Polanyi refers to this ability as “tacit” knowledge. Sometimes you just know, without knowing how you know!

Polanyi uses the blind with his stick as an example. After long time exercise the stick starts to function as an extension of the hand of the blind, and he can almost “feel” the tip of the stick as he walks down the street. He does not know exactly how he can feel the tip so vividly, he just does. He has tacit “bodily” knowledge that he can’t express in words.

I have another example of more abstract tacit knowledge from my personal life. In addition to being a statistician, I am a hobby ornithologist. Since I was a small boy I have enjoyed bird watching, and after about 35 years of experience, I have acquired a certain level of tacit knowledge with regard to bird identification. Sometimes, just a short glance is enough to identify the bird, but if you ask me how I recognized it, I may be unable to point to the specific feature that “gave it off”. Among bird watchers this is well known, and they often refer to this as “jizz”.

Polanyi describes the tacit dimension as a relation between something which is close to us (he calls it the proximal part) and something which is farther from us (the distal part). We use the proximal part as an extension of ourselves to understand or predict the distal part.

By integrating and internalizing the proximal part, we become able to use it as a tool for pointing our attention towards the distal part.

However, the rules, or the model if you like, which is formed by the integration of the proximal part may be unknown to us! The mind uses the observable distal part as a supervisor in the integration process to form an unconscious model for attention.

So, where is the statistical connection in this story?

The aim of statistical inference is often to build a model to connect a set of easily observable input variables (X) to one (or several) desired output variable(s) (Y). The purpose is typically to use the estimated model for prediction. Just think of the weather forecast as an example. Given today’s air pressure, temperature and wind (X), what is tomorrow’s weather (Y) going to be?  Once found, the estimated model may be used to map newly observed inputs into predicted outputs. The output variable, often referred to as the response, is the quantity at which we point our attention. We may therefore consider the response variable as the distal part in Polanyi’s tacit relation. The input variables (X) is the proximal part.

My interpretation of Polanyi is that the distal part (be it a problem to be solved, a meaning to be understood, or a context to be explored) serves as the target guiding the integration process of the proximal part. Exactly how the integration is done is not important, and the final “model” may lies as a tacit construct in our sub-consciousness. The important thing is that the model serves as a good predictor for the target!

Now, this equals what we statisticians call supervised statistical learning. It is supervised because the target is known, at least in the process of learning. The integration process of Polanyi parallels the estimation of some model f(X) which connects the distal part to the proximal part, like

Y = f(X) + e

(The term e is the noise term summing up the variation in the response, which can’t be explained by the model.)

Some may now object by saying that the integrated proximal part in statistical learning, the estimated model, is not tacit because the model is chosen subjectively by the statistician. The statistician thus has detailed information about the model, and perfectly knows how the input is connected to the response. This does not sound very tacit!

No, but there is a special class of algorithmic statistical models in supervised learning which does fit into the tacit dimension, and, in fact, the models are inspired by our brain network. They are known as artificial neural networks (ANN).

A neural net with input and output "neurons" connected by two layers of hidden "neurons". The hidden layer corresponds to the proximal part of Polyani's tacit dimension.
A neural net with input and output “neurons” connected by two layers of hidden “neurons”. The hidden layer corresponds to the proximal part of Polanyi’s tacit dimension.


The model f(X) of an ANN is not completely defined by the data analyst. As shown in the Figure, the algorithm assumes one or multiple layers of “neurons” connecting the input variables X with the response Y. One or several hidden layers of neurons are connected to X and Y and each connection is a linear or non-linear mapping of information with parameters trained from data in such a way that the distal part Y is approximated as close as possible. The first hidden layer may for instance summarize simple patterns observed in X, which are further categorized in the second layer, and so on. In this way an AAN may be used to model highly non-linear and complex output features.

The training of the neural connections is to a large extent left to the computer, and the resulting models may be quite complex, and the ANN models have been criticized for being too much of a black box and difficult to understand… Exactly! What goes on in the model is hidden, it is tacit!

The tacit dimension of our brain is probably quite similar. After all, the artificial neural networks are inspired by the biological neural networks… The neural networks may be trained to connect a set of perceived inputs to a target output in a way which is hidden to us and buried in our unconsciousness.  Beneath our consciousness our brains perform multivariate, statistical AND tacit modeling, and recent research has indicated that the cerebellum may play a key role in unconscious integration of inputs and facts, like I wrote about in a previous post “Google pedagogics”.

The artificial neural networks are as mentioned highly flexible, and there is always a risk of over-fitting the model, that is, the model fits the observed outputs of the training data too well, and the model is likely trying to explain noise in the response. This reduces the predictive performance of the model for new observations, and in order to reduce this problem it is customary to “prune” the networks by removing some “weak” connections in order to make is less complex and less vulnerable to random fluctuations in the response. As I wrote in my previous post new theories about the effect of sleeping suggest that there is natural pruning going on in our brain during the NREM phase of our sleep. Hence, it appears that our brain is also fine-tuning our tacit knowledge during sleep, making it for efficient for later use.

So, to sum it all up, our brain appears to have the ability to use its biological neural network to integrate proximal information into a flexible model which can be used to understand or predict a distal target or response, and sometimes this is a completely unconscious and tacit process.

All our brain lets us know… is that we know…!


Polanyi, M. (1966). The tacit dimension. Doubleday, N.Y. US.


The post Polanyi and the neural networks appeared first on Metacognition.

]]> 6
Your random attention Tue, 06 Oct 2015 06:19:58 +0000 Are you good at multitasking? Some self-confident people may believe they are skilled multi-taskers. It is also a common saying that women are better at multi-tasking than men. They are […]

The post Your random attention appeared first on Metacognition.

Are you good at multitasking?

Some self-confident people may believe they are skilled multi-taskers. It is also a common saying that women are better at multi-tasking than men.

They are all wrong…

Research shows that no-one is truly a multi-tasker in such a way that he or she can attend to two or more things simultaneously. Attention studies show that we are all serialtaskers. We can only attend to one thing at a time! Those who appear to be multi-taskers have instead a higher capability of switching rapidly between different focal points, whereas the rest of us (myself included) change attention at a slower pace. Attention is therefore a continuous process which jumps from one focal point to another and at a shifting pace.

When you are conscious you always attend to something. Maybe you are eavesdropping the conversation at the neighbor table during lunch because you think you heard someone mention your name, or maybe you are fully focused on finding your way as you drive through an unfamiliar town. It may also be that your attention is more inward-pointed as you concentrate on solving a mathematical problem which has troubled you for days.

In any case you have, by some reason, chosen your focal point. However, theoretically there are infinitely many other things you might have payed attention to, but for some reason you made your choice.

So, is attention merely a random process?  Yes…

Well, fortunately, it is not as unpredictable as the outcome of flipping a coin or rolling a dice, but attention appears to be a random sampling process, or what statisticians call a stochastic process, where some focal points are more likely to be chosen than others. For instance, if you are driving through an unfamiliar area, you are (hopefully) likely to pay close attention to the traffic and to signs along the road, and less likely to notice a bird flying by (well, unless you are an ornithologist, that is). But, at any split second you have to choose one thing to attend to, and one thing only, and like drawing a ball out of an urn, you sample your attention point.

Attention is therefore like the outcome of statistical sampling from a probability distribution. We may call it the attention distribution. Each potential target of your attention may be selected with a certain probability.

The next question is: Which factors influence this probability distribution for the potential targets of our attention?


Every second you are bombarded by billions of bits of internal and external information from your senses, and your brain performs a tremendous amount of work filtering the incoming data.  You should be happy for this! It would be quite impossible and very disturbing for your conscious self to handle all this information.

The competitive networks (Rolls and Deco, 2010) in the cortical regions receiving input stimuli, e.g. the visual cortex, perform statistical classification, categorization and data reduction long before you are aware of it. A new study (Brascamp et al., 2015) even shows how the visual cortex makes unconscious interpretative decisions before the transformed data are sent to association networks in your (conscious) frontal cortex. On its way the information also passes sub-cortical regions where it is further modified by contextual memories of similar, previous experiences and by your motivations, values and interest.

This is an unconscious process, but the data filtering is influenced by your brain physiology and your personality. The brain has learned over the years of your life what you find relevant for you, and irrelevant information is likely filtered out long before it reaches your consciousness.

A further modification of the attention distribution is further executed in regions of the ventromedial cortex and the orbitofrontal cortex. These brain regions constantly monitor your actions and compute their potential values (Dehaene, 2014). Again, this is an unconscious process, and you may now start to wonder whether your conscious self has any vote in the election process of our attention, but in fact, the unconscious part of you is far better at multivariate statistical computations than your conscious self, because your working memory can only handle a few chunks of information at a time.

However, you are of course not acting like a slave to your unconsciousness. If you consciously have decided to focus on something, like solving a difficult mathematical problem, your prefrontal cortex sends signals to central brain regions to inhibit undesired disturbances, and this makes a final modification to the sampling distribution of your attention. Depending on your ability to focus, this may squeeze most of the probability mass into a few potential attention points, making you almost oblivious to your surroundings.

In summary we see that your brain prepares, both unconsciously and consciously, a range of potential input signals you can attend to and assigns a probability to each attention point reflecting its relevance and potential value to you.

THEN, you draw your statistical sample. At that given moment you sample ONE attention point, and thanks to your statistical brain, you have most likely chosen something relevant for you in the given context.

The information filtering property of the brain and the relevance/value assignment is a key reason to why it is so difficult for teachers to keep the attention from the students over a long period of time. If a student’s unconsciousness finds the incoming data as irrelevant, it is filtered out and attention is pointed somewhere else. A good teacher should therefore present the topic in a way that the student finds relevant for him or her. But the problem is of course that in a classroom with 25 students, there are also 25 different definitions of “relevant”.

I mentioned earlier the effect of strong focus on the probability distribution of attention. A strong focus will through inhibition of distractions put most of the probability mass on a few potential attention points. The attention process will then move around in a rather limited output space. Oakley (2014) refers to this as the focused mode. This mode helps us put all efforts into solving difficult problems.

The opposite and less focused mode, the diffuse mode, corresponds to a wider probability distribution for attention. In this case, the distribution is mainly a product of the unconscious, and attention and thoughts may wander more freely. Many enter the diffuse mode for instance when they are walking, taking a shower, listening to music or having a nap.

Many have experienced the potential benefit of the diffuse mode in problem solving. Sometimes the focused mode may put zero probability on attention points which would provide the solution to the problem, and the attention process has no chance of reaching the solution. You are in fact focusing too hard! Then defocusing may flatten out the attention distribution and increase the probability of stumbling upon the solution! The positive effect of defocusing (for instance by taking a stroll, a nap or in some other way disconnecting from the problem) also appears to be related to the cerebellum, which according to recent discoveries, has the ability to combine stored memories and input signals in new and creative ways (Saggar et al, 2015).

So, if you are a student, stuck with a seemingly unsolvable problem at an exam, try to defocus a bit to widen your horizon. It may help your brain to sample the correct attention point!



Rolls, ET and Deco, G. (2010). The Noisy Brain: Stochastic Dynamics as a Principle of Brain Function. Oxford Univ. Press, UK

Brascamp J. et al, (2015). Negligible fronto-parietal BOLD activity accompanying unreportable switches in bistable perception. Nature Neuroscience.

Dehaene, S (2014). Deciphering how the brain codes our thoughts. Penguin.

Oakley, B (2014). A mind for numbers. Tarcher, Los Angeles, CA.

Saggar, M et al., (2015). Pictionary-based fMRI paradigm to study the neural correlates of spontaneous improvisation and figural creativity. Scientific Reports, 5.

The post Your random attention appeared first on Metacognition.

]]> 3
Sleep improves learning by statistical shrinkage and variable selection! Wed, 09 Sep 2015 06:01:47 +0000 Scientist have always been curious to why we feel tired and must sleep after a long day. New theories in neuroscience may explain why, and interesting enough, what happens during […]

The post Sleep improves learning by statistical shrinkage and variable selection! appeared first on Metacognition.

Scientist have always been curious to why we feel tired and must sleep after a long day. New theories in neuroscience may explain why, and interesting enough, what happens during sleep bears resemblance to statistical methods like shrinkage estimation and variable selection. The theories also make it clear why sleeping is so important for effective learning and long term memory.

When we are awake and conscious we are constantly processing input information, and we are in a state of learning much of the time. New thoughts and memories are formed by so-called long term potentation (LTP), a process were the strengths of the connections (synapses) between the neurons in the brain are increased. Some connections are also weakened by an opposite process called long term depression (LTD), if the connections turn out to be less important.

Due to random spiking of neurons “errors” may also be induced by LTP of non-important synapses and LTD of important synapses. (The random spiking of neurons is an important property og the brain network which I will return to in a later post.)

A memory or a skill that we learn is stored as a so-called “engram”, a strengthened pathway between neurons in the neuronal network which is formed by LTP and LTD. You may think of the engrams as valleys in an energy landscape with mountains formed by LTD and the valleys by LTP. Recalling a memory activates signals along the pathway like a ball rolling through the valleys of the energy landscape.

The formation of memories may mathematically be described in terms of recurrent associative networks (e.g. Rolls and Deco, 2010). According to these models learning may be described by the Hebb rule (Hebb, 1949) where input signals from numerous neurons to a single neuron are multiplied by their respective synaptic weights and summed. If the sum of weighted inputs is sufficiently large, the receiver neuron will pass on a signal. If a signal is created, the input synapses are strengthened through LTP. This is known as “the integrate and fire model”.

Let’s leave the neural technicalities there for now, and go on to what this has to do with sleeping, statistics and learning.


Well, according to a relatively new theory by Tononi and Cirelli (2006) the reason why we feel tired is because the conscious learning period (daytime) results in a net increase in synaptic strengths in our brain. This has an exhausting effect because stronger synapses means more energy consumption.

Our brain accounts for approximately 20% of our energy demand during the day, but it is not constant. The energy demand is at its lowest right after sleep, and may increase by up to 18% during the awake period. This makes us tired, and new learning potential by further LTP is reduced.

So, what happens during the night which turns the energy level down? According to Tononi and Cirelli the explanation may be a overall and uniform LTD (weakening) of all synaptic strengths in the brain. In statistical terms we call this scaling. By dividing all synaptic weights by some constant, the average synaptic weight is normalized to a base level before we wake up ready for new inputs and learning.

Exactly how this scaling is carried out in the brain during sleep is still under investigation, but a theory is that the slow wave activity observed during the NREM phase of sleeping induces a massive LTD proportional to the synaptic strengths. This means that strong synapses remain strong and weak synapses remain weak. The weakest connections may also be removed altogether. Hence, long term memories and rehearsed skills are not removed, but rather reinforced during sleep by the removal of irrelevant connections.

The way irrelevant connections are reduced and potentially removed during sleep is very similar to what is known as shrinkage methods in statistical inference. The well-known Ridge estimator in multiple regression, which facilitates a kind of scaling of regression coefficients, is an obvious example. Shrinking estimates towards zero often reduces variance and improves prediction. We may say that the shrinkage improves the signal-to-noise ratio, which appears to have the same effect as taking a nap for our brain connections.

Shrinkage by scaling is not the only option. So-called soft-shrinkage which is part of methods like LASSO (Tibshirani, 1996) and ST-PLS (Sæbø et al., 2008) is another. An extra benefit of this approach is the possibility to remove the influence of some variables altogether by forcing their effects to be zero, thereby introducing variable selection. Scaling, on the other hand, will not remove variables completely.

The mechanisms for overall LTD during sleep is currently unknown, but a soft-shrinkage property could explain complete removal of irrelevant connections during sleep. It will be interesting to follow the research in the future at this point.

This theory supports the findings that rest improves long term memory, and it gives a clear message to us all that getting enough sleep is important for learning. Not only does the increased energy demand during the awake period reduce the learning potential, but sleep also consolidates the already stored engrams.

Tononi and Cirelli states that sleep in fact may be the price to pay for our high conscious level and learning potential through brain plasticity.

However, their theory is not the only theory explaining why sleep consolidates memory. Others suggest that the dreaming phase during REM sleep also has a function. Replaying thoughts and experiences during dreaming may increase the synaptic strengths of the engrams and improve long term memory. Further, the rather bizarre sequences of associations that may occur during sleep may have a creative side effect. Many have experienced that their dreams give creative input to their daily activity, and the great inventor Edison even used this actively in his work! The story also tells that the russian scientist Mendelejev, who worked hard putting together the table of the elements, got the final pieces to fit together during sleep.

Sleeping is therefore very important for learning and creativity, and even a short nap during the day can have an invigorating effect on learning.
Maybe a “power nap” after lectures should be compulsory for all students?



Tononi, G. and Cirelli, C. (2006). Sleep function and synaptic homeostasis. Sleep Medicine Reviews, 10, 49-62.

Hebb, D.O. (1949).The organization of behaviour: A neuropsychological theory, Wiley, New York.

Rolls, E.T. and Deco, G. (2010). The noisy brain, Oxford University Press, Oxford.
Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. R. Statist. Soc B, 58 (1), 267-288.

Sæbø, S. et al. (2008). ST-PLS: a multi-directional nearest shrunken centroid type classi er via PLS, J. Chemometrics, 22 (1), 54-62.

The post Sleep improves learning by statistical shrinkage and variable selection! appeared first on Metacognition.

]]> 1
“Google pedagogics” Tue, 21 Jul 2015 13:03:49 +0000 “Google-pedagogics” may be one of the reasons why children in western countries score low on mathematical tests like the PISA test. By the term ”Google pedagogics” I mean the relative […]

The post “Google pedagogics” appeared first on Metacognition.

“Google-pedagogics” may be one of the reasons why children in western countries score low on mathematical tests like the PISA test.

By the term ”Google pedagogics” I mean the relative superficial way of teaching which has become more widespread in school in later years. There appears to be a more common perception nowadays that facts and rules need not be learned by heart, since the effective search engines on the internet provide the answer in milliseconds whenever needed. However, this entails surface learning, which for subjects like mathematics reduces progress and understanding.

Fortunately, there has been an increased focus on in-depth-learning lately, which certainly is the way forward, and here I will give some scientific arguments to why.

We have different memory systems in our brain, and two main categories are the working memory and the long-term memory. In the working memory we may temporarily hold 3-4 chunks of information “warm” for further combination and manipulation until they no longer are needed, after which they are cleared out. For instance, we use this memory to remember an address or a phone number until we have safely written it down.

Also, things we look up on the internet tend to be such chunks of information which are thrown out after use, and the next time you need the same information, it must be searched for again… Hence, even though the actual internet search for a given fact is extremely quick, the process of learning and work becomes slow!

The things that really stick in our minds are stored in long-term memory, which may be facts and concepts (semantic memory), events (episodic memory) and motor skills (procedural memory). Apart from episodic memory, like your first kiss, or other emotional episodes, the long-term memories require stages of consolidation and repetition in order to be stored properly.

The fact that repetition improves long-term memory is not a new scientific discovery, and is something most people know from own experience, but I will here argue that long-term memorization of facts and rules, for instance in mathematics, is very important for understanding, in-depth-learning and creative thinking!

Let’s first look at how we master skills like riding a bike or driving a car, which are examples unconscious long-term memory. I think no one will argue against the importance of practice in order to learn these skills.


In the early stages of learning a new skill, our brain performs highly conscious activity in the frontal part of the brain, the so-called prefrontal cortex. This is the brain region where we make conscious judgments, deductions, computations and planning. At this stage we try to figure out “what is the secret behind this?”, “how should I do this?”, or put in a more statistical way: “what is the general pattern which may be generalized?”. This process is slow and energetically demanding, but totally necessary.

This is a direct parallel to what we in statistics call model estimation. We collect observable “data” and try to extract the invariant properties, the “truth” beyond the variable input. In statistics there is also a theory known as Bayesian statistics, which is characterized by the way previous knowledge is modified by newly collected observations to yield improved and more secure knowledge. This is exactly how our brain works to learn a new skill. By each repetition, the “model” for the skill, which is executed by the motor center of the brain, is improved from the outcome of new attempts. On order words; We become more and more skilled!

I’m quite sure that whenever you ride a bike, you don’t pay much attention anymore to how it is done. You have it “in your blood”, or, as it turns out from a neuroscientific perspective, you have it “in your cerebellum”…

In the learning phase of riding the bike, there was another thing going on in your brain. The ”model” you applied for biking was copied by your cerebellum, the small brain region at the back of your head, and at a certain point in your training, you started to apply this stored model for biking instead. The cerebellum took care of the planning of your action based on this model, and adapted it to the current situation. The neuroscientist call this a “forward model” for the skill, whereas statisticians would call it a “prediction model”. The use of this model is an effective, quick, automatic AND unconscious procedure. At this stage, you had learned how to bike!

The role of the cerebellum as a controller and modifier of motor skills has been known to brain researchers for a long time, but lately new research has also shown that the cerebellum has a similar function for cognitive procedural skills. This is where the relevance to the mastering of subjects like mathematics really kicks in!

Solving mathematical problems using axioms and theorems is a cognitive skill, which should be learned in a similar way as biking, by practice, practice, practice! From the biking analogy we have learned that practice improves long-term memory and automation of the performance of the skill. It also improves the speed of computation. This really should be obvious; knowing the procedures by heart is far more efficient than looking it up on the internet or deducing it from axioms every time you need it!

Finally, I mentioned previously that long-term memory may enhance creative thinking. Also at this point the cerebellum appears to have a key role. A recent paper by Saggar and co-workers, published in Scientific Reports, show an association between creativity and increased activity in this brain region. A probable mechanism behind this is that the cerebellum, just as it can combine your automatic motor skills into new bodily movements, quite unconsciously can make new combinations from your automated cognitive “know-how”.


This may also explain why many experience that creative thoughts seem to appear out of nothing while doing other activity with low level of consciousness, like driving, jogging or showering. It is the cerebellum which speaks! But, this means that creativity depends on having the building bricks available; facts, rules and the cognitive procedures stored in long-term memory, automated through repetition.

A conclusion from this is that surface learning alone prevents long-term memory and creative thinking!



Saggar et al. (2015), Pictionary-based fMRI paradigm to study the neural correlates of spontaneous improvisation and figural creativity, Scientific Report.

The post “Google pedagogics” appeared first on Metacognition.

]]> 2