Evaluating predictions, or more broadly the end-products resulting from methodologies used to anticipate possible futures, should become the norm rather than the exception, as explained in a previous post. Such exercise should improve methods and processes and direct our efforts towards further research. We shall here make the experiment to assess a sample of open source predictions for the year 2012. This part will address the methodological problems encountered while creating the evaluation itself, and underline the related lessons learned. The second part (forthcoming) will discuss results.
Actually, there is nothing new here as estimating results for “predictions” is one of the fundamental principles of science (a scientific theory must have explanatory, descriptive and predictive power). If a theory does not fulfill the predictive criteria, then it must be disqualified. Things are relatively straightforward when dealing with hard science. They are much more complex when we are in he field of social science, and the very possibility to obtain predictive power is hotly discussed, debated and often discarded. If we consider the family of disciplines, sub-disciplines and methodologies – what we call here strategic foresight and warning (foresight for short) – that deal with future(s)-related analysis, then we are faced with even more challenges. Some methodologies will be considered as scientific, and among them, some are close to hard science, while others belong to the realm of social science. Other approaches will be seen as art and thus are considered as not having to be tested. Furthermore, everyone has her/his own vision of what constitutes good future(s) related analysis, what should be done and used, what is valuable and what is not.
Despite these difficulties, it is still worth our while evaluating those future(s) related efforts, which had the courage to make an evaluation for the future located on a timeline. This is, of course, a very small exercise and experiment, compared with what is done by The Good Judgement project, led by Philip Tetlock, Barbara Mellers, and Don Moore with funding by the Intelligence Advanced Research Projects Activity (IARPA) and explained in this article by Dan Gardner and Philip Tetlock, “Overcoming our aversion to acknowledging our ignorance.” Nevertheless, hopefully, it will also bring interesting results, and the reflection it imposes, the questions it brings are in themselves a very constructive practice.
For this experiment, the sample used is constituted of open source predictions for the year 2012 posted on the web from December 2011 to January 2012, as presented here.
The result of the evaluation, in a Google spreadsheet, can be downloaded here or viewed below. Explanations and discussion follow.
The sources used to evaluate the foresight are given in the seventh column (except when the answer is obvious or common knowledge, and thus does not necessitate reference to a specific source e.g. the European Union still exists).
The variety of format and methodologies, furthermore more or less explained, was a first challenge. How to evaluate consistently “predictions” delivered in ways as varied as classical analysis (e.g. The Financial Times – Beyond BRICS), scenarios (e.g. Tick by Tick Team), risks (e.g. CFR – Preventive Priorities Survey: 2012) or predictions mixed with policy recommendations and advocacy, what could be seen as a version of normative foresight (e.g. Foreign Policy with the International Crisis Group – 10 conficts to watch in 2012)?
The Council on Foreign Relations, risk and making likelihood explicit
The Council on Foreign Relations’ approach is a perfect example of this hurdle. Its risk list for 2012 was particularly difficult to evaluate considering the way it is formulated and the lack of information regarding the methodology (those challenges have been removed or to the least improved with the 2013 version, where we find more detailed explanations and where likelihood and impact are separated). To find out what the CFR exactly meant I had to turn to a companion article to the risk list published in the Atlantic, “Gauging Top Global Threats in 2012“. There we read:
“The contingencies that were introduced for the first time or elevated in terms of their relative importance and likelihood in 2012 included an intensification of the eurozone crisis, acute political instability in Saudi Arabia that threatens global oil supplies, and heightened unrest in Bahrain that spurs further military action.”
A contingency means “an event (as an emergency) that may but is not certain to occur”.
Thus we can deduce that the CFR saw all the events (their “risks”) listed as possible for the year 2012 – if not probable. This is on this basis that the evaluation was made. Hence, all the CFR “risk-statements” were mentally expanded as follows: “a mass casualty attack on the U.S. homeland or on a treaty ally” means, for evaluation, “a mass casualty attack on the U.S. homeland or on a treaty ally” in 2012 is possible and would have a major impact for US National Interest (according to the tier to which the risk belongs) or “a major military incident with China involving U.S. or allied forces” means “a major military incident with China involving U.S. or allied forces” is possible in 2012 and would have a major impact etc.
Making the “risk-statements” more explicit for evaluation (however not transforming the statement itself) immediately underlines how the fusion of likelihood and impact existing most commonly in the idea of risk (until the concept itself was revised by the new ISO31000: 2009 norm) creates supplementary difficulties in terms of evaluation, hence my personal reluctance to use the concept, despite its fashionable character. What are we to judge: a likelihood? an estimation of impact? a timing? As already mentioned, the CFR Preventive Priorities Survey tackled indeed this problem and now (2013) gives detailed results in terms of impact and likelihood.
This underlines how crucial it would be, ideally, to always include, for all results of future(s)-analysis an estimation of likelihood, as done, for example, in the Intelligence Assessments (see p.14 of the ICA on Global Water Security).
In our sample, each prediction, or series thereof, corresponds to one or another methodology. Yet, rather than trying to standardize thoughts, for example transforming what the authors wanted to write in a sentence easy to evaluate, I chose to keep the text as it was, breaking it down in various paragraphs most of the time, sometimes expanding it mentally as explained above for the CFR, and in agreement with their methodology, but not altering it, and to evaluate it as such. The exercise was constructive in itself and led to interesting points. We shall see with the next post if the results will also say something about the foresight methodology itself.
When the text was far too removed from something that looked like a judgement on the future, for example when it was only an opinion on what was happening, or when it was a 50/50 possibility, I excluded the sentence or paragraph from the sample (in red in the spreadsheet).
Scenarios, timing and content
As I started the concrete phase of evaluating statements with the fictionalized scenario made by Tick by Tick Team (Finance), it very quickly became clear that I had to make two types of assessment: one regarding the plausibility and logic of the content of the prediction itself, the other the accuracy of the timing. Indeed, some of the predictions made still sounded plausible, had not happened in 2012 but could not be ruled out for the short to medium term, e.g. “Greece leaves the Euro, returns to the Drachma.” (3002 – this number corresponds to the identification given in the database, to facilitate reference). To me there is a large difference with a prediction that is plainly wrong in terms of content and thus impossible in terms of timing: e.g. Syria deals with the “initial post-Assad stages” (2011) or “Obama decides not to run for elections” (3011).
Furthermore, this approach will allow me to test a hunch according to which we are in general much better to explain phenomena than to time them, would it be only because we hardly ever work on timing (outside the hard science realm).
Evaluating content and timing: a difficult, uncertain, never-ending task?
Thus, columns 4 and 5 display marks for content and timing, ranging from 0 (completely wrong) to 1 (completely accurate). There is, however, a major hurdle with this approach. First, by judging the content in terms of plausibility of dynamics, I evaluate one understanding (the author’s) against another (mine). There is little we can do about it as this is the core of research and debates in social science, besides giving evidence (column 7, the sources), developing a coherent argument and/or pointing out flaws in the argument subjected to the evaluation. A commissioned report would need to be more detailed and specified than I could be in the framework of a volunteered experiment.
Second, it implies that by evaluating the plausibility of something happening in the future, then I am myself making a judgement on the future, thus a prediction. Ultimately, those challenges should be resolved through the happenstance of events and facts, which suggests that evaluations should themselves be reviewed and followed in time. This is certainly not ideal, but still better than to lose the information on timing and content, which would happen if one chose with black and white, true or false, 0 and 1 answers.
Objectivity (as much as biases allow) of the person assessing the predictions is crucial, and the use of teams that would discuss and confront their analyses would be best. Furthermore, the latter would also allow overcoming plain lack of knowledge on one issue or another.
This leads us to a last challenge that is not easily overcome for some predictions: the information that is available to the person doing the evaluation. Still using the CFR example, and more particularly the risk of “a mass casualty attack on the U.S. homeland or on a treaty ally”, some actions taken throughout the year by authorities may have prevented a risk to materialize, thus the prediction could be seen as false. Certainly, had no intelligence, defense and diplomatic actions existed, then such risk would have materialised. Such state’s actions are ongoing, and, as an outsider, we can only estimate (without complete certainty) that it is because of them that the threat did not materialize, not because the risk was incorrectly identified. An evaluation made by an insider with access to all classified documents would be made with more confidence. Here, I could only estimate the reality of the risk to the best of my understanding and knowledge, for example with the use of counterfactuals.
Should all those challenges, the existence of uncertainty even in evaluation, lead us to conclude that trying to evaluate foresight products is useless? My first answer, at this stage, is no because all the questions one asks or should ask oneself and that are forced by the evaluation are crucial and may only lead to better methodologies and thus to better judgements on the future. It is thus a gage of quality. We shall see next what the results of the assessment, keeping in mind all their imperfections, may tell us.
“Gauging Top Global Threats in 2012″ - Interviewee: Micah Zenko, Fellow for Conflict Prevention, Interviewer: Robert McMahon, Editor, December 8, 2011, The Atlantic.