Posted by & filed under Data, Science, January 19 2022.

1. Truth in Data

In a blog posting from April 2020, I discussed the development of a visualisation of data related to the global COVID-19 pandemic. In the comments below the blog posting, I mentioned that there are several issues with the data in terms of accuracy and what the ‘truth’ of the data actually is. I will expand on this here.

Defining precisely what the ‘truth’ is in data and statistics is not easy. Most statistical data in the real world is generated from statistical samples and then a process of inductive inference, based on statistical theory, is used to generalise observations and conclusions to a broader ‘population’. Many errors and biases can be introduced in this process, and many assumptions must be made. A common error is that the sample is not representative of the population (Baggini, 2017; Spiegelhalter, 2019).

Another error is that all conclusions about data collected in the real world ultimately rely on how well the conceptual ‘model’ used for the data collection and analysis approximates the real world, and sometimes too much faith can be based on the model’s claims to objective ‘truth’, leading to sensationalist claims in newspapers (Cairo, 2016).

Epidemiology, and data concerning the global COVID-19 pandemic is no different, and any claims for objective conclusions about the impact of the pandemic on global society based on this data must include details about how the data is collected, analysed and presented, and the assumptions that have been made in this process.

The global pandemic of the last 2 years has seen newspapers, TV and web news reports and social media full of data in perhaps an unprecedented fashion; people have never been more exposed to statistics, graphs and data visualisations concerning the number of COVID-19 cases and deaths. Some of this has demonstrated good examples of data collection and analysis (European Centre for Disease Prevention and Control, 2021), and presentation in the form of data visualisations (BBC, 2021a; The Guardian, 2020a),  but there have been many bad examples of misleading communication concerning the ‘truth’ of the pandemic (The Conversation, 2020).

This flood of data (or ‘infodemic’) has led to much discussion about how well the data reflects reality: how many people around the globe have the virus, and how many people have died because of it? Issues of ‘truth’ in health data have never been more hotly debated and analysed. There are many threats to truth, including a general mistrust in science, experts and politicians, but also a reliance on untrustworthy social media sources and groups promoting ‘disinformation’ and anti-vaccination beliefs (The Lancet, 2020).

A common bias in many countries inherent in collecting data about cases of infection by COVID-19 is that only people with symptoms are tested, so asymptomatic cases are unrecorded, resulting in a systematic bias causing an underestimation of infection rates, which can be quite significant (Spiegelhalter and Masters, 2021). “The apparently simple task of counting COVID-19 deaths is far from easy, with no ‘true’ answer” (Spiegelhalter and Masters, 2021: 108).

In many cases, including in western governments, collection of COVID-19 data has fallen short of statistical ideals. For instance, in the UK at the start of the pandemic, COVID-19 ‘deaths’ were presented in a way which was misleading (there was no time limit for deaths after the date of a recorded infection). This was changed to a 28-day limit in August 2020 (Spiegelhalter and Masters, 2021).

2. Comparing Countries

One problem with the presentation of this data globally has been the difficulties arising from a lack of any standardised way of analysing the data and presenting case and death numbers. Different countries (and even health and statistics agencies within the same country) use different methods and definitions: for example, even within western Europe there are important differences, with some countries counting deaths in care homes and hospitals and others not, and some countries only counting deaths where the virus is mentioned on death certificates, and others only where there has previously been a positive test for the individual. Others use only numbers of excess deaths (‘excess mortality’).

Another issue with comparing death rates between countries is that some countries have quite differing age distributions among their populations, and COVID-19, which causes more deaths in older people, disproportionally affects countries with a relatively elderly population. Comparing countries at a national level is also problematic because death rates (in 2020) in some countries were highly localised within regions and cities (such as Spain and Italy) and some countries were affected throughout the entire national area (such as the UK). Data collected at country-level hides this geographic distribution. There are also different data anonymisation and aggregation practices between countries.

Other countries such as Tanzania, North Korea, China, Iran and Turkmenistan are governed by unstable, secretive or undemocratic governments and the data they have claimed about numbers of cases and deaths in their respective countries are probably wildly inaccurate. Even a country such as the USA did not start properly collecting and reporting data at a federal level until March 2021 (BBC, 2020; The Guardian, 2021; Spiegelhalter and Masters, 2021).

3. Outlook and the Future

In April 2020, Professor David Spiegelhalter wrote an article for the Guardian website that outlined the difficulties inherent in comparing COVID-19 cases and deaths between countries. This was then interpreted by the UK Prime Minister Boris Johnson, in a statement in the UK parliament in May 2020, as meaning that comparing countries was, in the words of the chief medical officer Professor Chris Whitty, a “fruitless exercise”. This highly public example shows that the ‘truth’ of data can be distorted with poor presentation and communication, and Professor Spiegelhalter tried to clarify things in subsequent public writing to emphasise that “we should now use other countries to try and learn why our numbers are high” (The Guardian, 2020b).

Even with all of these difficulties, it is important to study the differences between populations and countries. Some useful science, and ‘truths’ can be evaluated from the data, such as that one particular strategy used by a group of countries to control the virus (such as a national lockdown or social distancing) is associated with a broadly different mortality rate than another strategy used by other countries.

Improvements in data communications to the public, and better effectiveness in terms of affecting outcomes in dealing with the pandemic, can be fostered by international collaboration, diversity of approaches, and better data collection and education (Pearce et al., 2020).

References

Baggini, J. (2017) A Short History of Truth: Consolations for a Post-Truth World. Quercus.

BBC (2020) Coronavirus: Why are international comparisons difficult? [Online] [accessed 15th November 2021] https://www.bbc.co.uk/news/52311014

BBC (2021a) Covid map: Coronavirus cases, deaths, vaccinations by country [Online] [accessed 15th November 2021] https://www.bbc.co.uk/news/world-51235105

BBC (2021b) Covid: The UK is Europe’s virus hotspot – does it matter? [Online] [accessed 25th November 2021] https://www.bbc.co.uk/news/health-58849024

Cairo, A. (2016) The Truthful Art: Data, Charts and Maps for Communication. New Riders.

The Conversation (2020) Next slide please: data visualisation expert on what’s wrong with the UK government’s coronavirus charts [Online] [accessed 25th November 2021] https://theconversation.com/next-slide-please-data-visualisation-expert-on-whats-wrong-with-the-uk-governments-coronavirus-charts-149329

European Centre for Disease Prevention and Control (ECDPC) (2021) How ECDPC collects and processes COVID-19 data [Online] [accessed 15th November 2021] https://www.ecdc.europa.eu/en/covid-19/data-collection

The Guardian (2020a) How coronavirus spread across the globe – visualised [Online] [accessed 15th November 2021] https://www.theguardian.com/world/ng-interactive/2020/apr/09/how-coronavirus-spread-across-the-globe-visualised

The Guardian (2020b) Author of Guardian article on death tolls asks UK government to stop using it [Online] [accessed 15th November 2021] https://www.theguardian.com/politics/2020/may/06/author-of-guardian-article-on-death-tolls-asks-government-to-stop-using-it

The Guardian (2021) Which countries have fared worst in the pandemic? [Online] [accessed 15th November 2021] https://www.theguardian.com/theobserver/commentisfree/2021/apr/18/which-countries-have-fared-worst-in-the-pandemic

The Lancet (2020) The truth is out there, somewhere. Lancet, 396(10247): 291.

Pearce, N., Lawlor, D. A., & Brickley, E. B. (2020) Comparisons between countries are essential for the control of COVID-19. International journal of epidemiology, 49(4): 1059-1062.

Spiegelhalter, D. (2019) The Art of Statistics: Learning from Data. Pelican Books.

Spiegelhalter, D. and Masters, A. (2021) Covid by Numbers: Making Sense of the Pandemic with Data. Pelican Books.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>