Daily Case Counts Are Not Meaningful

In Part 2 of my posts on COVID-19 in Ontario, I said that for Part 3 we’d be taking a look at the updated case statuses, as well as hospitalizations. However, I’d like to put that on hold for a moment to instead address something which I think needs be far more, and worryingly continues to occur.

I will also preface this post with the same caveats and disclaimer for my other analyses on this topic related to health and disease:

  • I am not an epidemiologist, nor am I a subject matter expert on disease nor public health policy
  • All handling of the data / code / statistics, interpretations thereof, and thoughts expressed are my own and only my own
  • This post may contain errors or omissions given the above which are only my own

I intentionally choose a rather inflammatory title for this post because I wanted to make this point strongly and because I feel it needs to be made, and strongly. As opposed to my usual writing style, I will not have an introduction and background, but instead state the bottom line up front:

Looking at daily case counts for COVID-19 alone is, at best, uninformed and naïve, and at worst, highly misleading. 

In fact, I will illustrate that:

  • Apparent exponential growth in positive cases could be explained by the growth in testing a population with a set amount of disease present
  • What might appear to be large daily changes in the absolute number of cases can be duplicated as nothing more than statistical noise due to sampling

DAILY NEW CASES AS A RANDOM SAMPLE

Consider a population, of size n, in which there exists a disease as a binary state for each individual, 0 or 1.

For the purposes of this exercise, we will not consider the time aspect of disease (i.e. individuals recovering), nor any mechanisms of transmission, onset delay, deaths, population change, treatment, etc. We only make the simple assumption that a set number of individuals in the population have the disease – let’s call that subset of the population d. Then the proportion of the population which has the disease is:

    \[ p = \frac{d}{n} = \mbox{\% of population with disease} \]

Epidemiologically speaking, this is just the prevalence, and in this case we are making a very simple assumption that this is just a fixed number which does not change (i.e. the period prevalence and point prevalence are identical and constant).

Now let us consider how the disease is identified via testing. Testing is administered on samples of the target population in order to detect the disease. For the purposes of this example, we make very simple assumptions about testing: tests are 100% accurate (i.e. there are no false positives or false negatives), only need be administered once, have no effect on the disease, and results in a binary outcomes – positive (1) or negative (0). We also assume there is no delay between administering the test and receiving the result (i.e. test results are instantaneous).

What is more complicated about the testing, however, is that the number of daily tests administered is not constant but varies. Each day, depending upon the “testing capacity” a random sample of the population is drawn without replacement, and only those individuals are tested.

We can simulate the whole process in python using random number generation. For our hypothetical population, we will take the base prevalence in the population to be 4 percent:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Instantiate a target population with a 4% baseline infection rate
n = 2000000

x = pd.Series(np.random.choice([0, 1], size=(n,), p=[0.96, 0.04]))

x.value_counts()/len(x)
0    0.959859
1    0.040141
dtype: float64

Next, we will create a mathematical function to represent the number of daily tests performed. Here we will use a logistic growth function of the form:

    \[ f(t) = \dfrac{10000}{1 + e^{-\sqrt{t}+6}} \]

I had to play with the parameters somewhat to get a nice-looking curve, but the end goal was to get a gradually rising increase in the number of tests starting from 1 which levels off for up to the 100th day of testing as capacity increases over time. The plot below shows this is the case:

# Create a logistic function for increasing number of daily tests
sample = np.arange(0, 100, 1)
z = (1/(1 + np.exp(-sample**0.5+6))*10000).astype(int)

plt.figure(figsize=(8,6))
plt.plot(z, marker='.')
plt.xlabel('Day Number')
plt.ylabel('Daily Tests')
plt.show()

That looks pretty good to me! Just a quick check – what will be the total number of tests performed in our simulation?

z.sum()
606195

Great, so just over 600K tests in total. Now for the real work – for each day, we take a random sample from the population as specified by our daily number of tests (the function above) and record total number of outcomes, both positive and negative. We’ll save these for use later in a DataFrame:

# For each day, take a random sample and count
positives = list()
negatives = list()

for i in np.arange(100):
    
    # Get the daily number of tests
    n_tests = z[i]
    
    # Take the sample and calculate positives and negatives
    sample = x.sample(i)
    positive = sample.sum()
    negative = len(sample) - positive
    
    # Remove the tested individuals from the population
    x.drop(sample.index, inplace=True)

    # Append to the daily positive and negatives counts
    positives.append(positive)
    negatives.append(negative)
    
    # Output
    print(i, z[i], len(sample), len(x))
    
# Make a dataframe
cases_df = pd.DataFrame({'positives':positives, 'negatives':negatives})
0 24 24 1999976
1 66 66 1999910
2 100 100 1999810
3 138 138 1999672
4 179 179 1999493
5 226 226 1999267
...
...
...
95 9769 9769 1432988
96 9780 9780 1423208
97 9791 9791 1413417
98 9801 9801 1403616
99 9811 9811 1393805

That’s a lot of output, but you can see we’ve generated positive and negative case counts for each day, and the final number for the remaining population of our 2M outside the 600K we did not test (~1.39M).

Now let’s take a look at the daily case count!

plt.figure()
plt.plot(cases_df['positives'])
plt.xlabel('Day')
plt.ylabel('Count of Positive Cases')
plt.show()

Well, that’s pretty shocking isn’t it! Look at that steep rise in cases… we can see also there is some noise in the curve due to the fact we are randomly sampling from the population each day.

This also duplicates near exponential growth with the curve flattening, as we can see if we look at the aggregate count over time, and apply a logarithmic scale on the y-axis. This figure should look pretty similar to a lot you’ve seen recently:

plt.figure()
plt.plot(cases_df['positives'].cumsum(), marker='.')
plt.xlabel('Day')
plt.yscale('log')
plt.ylabel('Cumulative Count of Positive Cases')
plt.show()

Now let’s look at the day-over-day percentage change in daily cases, as this was the original metric we were interested in from the beginning:

plt.figure()
plt.plot(cases_df['positives'].pct_change()*100.0)
plt.xlabel('Day')
plt.ylabel('Change in Daily Positives (%)')
plt.show()

Wow! You can see there is a lot of variability in the day-over-day percentage change in the absolute number of cases (positives). When the sample size is very low, closer to day zero, the percentage change swings widely because the total number of tests is small, but as the number of daily tests increases, we can see it level out to fluctuations around zero.

This is what we would expect, as we are just randomly sampling, however, there is still a swing of positive and negative changes in the day-over-day cases, due to the random sampling we are doing, even as the number of cases tested daily becomes more consistent (after day 70 or so).

Because we are randomly sampling each day, we just may happen not to pick up as many of the 4% of people with the disease, even if the number of tests performed is the same.

Even if we just look after 70 days of testing, when the number of tests starts to flatten, we can see that the daily percentage change ranges from ±10%, which is largely just due to the different random samples being drawn (who we chose to test each day) and how many positives that happened to include:

DAILY POSITIVES IN PROPORTION

The right answer, of course, is to look at the data in context – instead of looking at the absolute number of cases, since this is highly influenced by the number of tests performed, we should instead be looking at the proportion of cases which test positive (i.e. the prevalence), ideally in aggregate.

We can do this for our simulation, looking at the daily positive proportion:

# Calculate the percentage positive tests
pct_positive = cases_df['positives']/(cases_df['positives']+cases_df['negatives'])

# Plot
plt.figure()
plt.plot(pct_positive*100.0)
plt.xlabel('Day')
plt.ylabel('Percentage positive (%)')
plt.show()

As you can see, there is again a lot of noise near the beginning when the number of individuals tested is small, but it soon evens out and is around our 4% base rate which we know to be the true value in the population. This figure looks very similar that for the percentage change in daily positives we saw before.

We can also look at the cumulative running proportion of positive tests, and see how this changes as we test more and more each day:

# Calculate the running percentage positive
pct_positive_running = cases_df['positives'].cumsum()/(cases_df['positives'].cumsum()+cases_df['negatives'].cumsum())

# Plot
plt.figure()
plt.plot(pct_positive_running*100.0)
plt.xlabel('Day')
plt.ylabel('Running percentage positive (%)')
plt.show()

You can see there is noise at the beginning again, but the proportion of positive tests quickly approaches the true prevalence, the more tests that are performed.

Additionally, since the sample size gets larger with each day as more tests are performed, the amount noise around the true value for the prevalence (4%) is reduced compared with what we saw above, as our percentage positive approaches that true value the more tests we do.

LOOKING AT REAL DATA

Ok then, so where did this crazy logistic growth function I cooked up come from, and why did I pick a baseline prevalence of 4% for my hypothetical population? And how does this all tie back to COVID in Ontario?

Well, a picture is worth a thousand words, so 4 pictures must be worth 4,000, so let’s look at the real data for COVID testing in Ontario. Here is the number of new daily cases for Ontario:

And here is the number of daily tests:

And here is the daily percentage positive:

And finally, here is the aggregate percentage positive:

(Note: in the early days of testing the number actually decreases, as 1 or 2 probable positive cases were backed out)

You can see the number of daily positives rose rapidly along with the volume of tests as it increased. Does that orange curve look at least somewhat similar to something we saw before…?

In the latter figures, you can see that Ontario’s percentage positive also starkly rose as testing increased, the daily percentage positive reaching nearly 12% at one point, and then as more and more testing was done this normalized back down to around 3% most recently.

Oh, and what’s the final aggregate percentage positive for all tests done in Ontario so far, just over 560,000?

Why, 4.1%, of course.

CONTEXT MATTERS

As one of my relatives said when I was speaking with them about everything that’s been going on with COVID: “Facts scare me; Context is important.”

The point I’m trying to make here, that many others have also made but seems too often being forgotten or ignored, is that the number of cases need to be looked at in the context of testing.

The more you test, the more cases you are going to find.

Many other people have made this point. Nate Silver has weighed in on the issue fairly heavily. Trump even pointed it out in his not-so-intelligent way, and was lampooned for it.

The message here is that the more you look for something, there more you’re going to find (up to a point, which we’ll get to in a second) and so pretending that a daily “rise in cases” is representative of the actual prevalence of the disease when measured in absolute numbers, given that the amount of daily testing varies, is misguided – if not completely false.

I have not even yet addressed in this post, as I did previously, that the tests performed for COVID are on a highly biased sample.

It is rather concerning to me personally that we live in a day and age that there is not a general degree of numeracy enough to recognize this, and that more worryingly, policy decisions impacting millions of Canadians are sometimes being made as a result (!) Montreal delayed reopening it schools due to a rise in cases, however this rise is, of course, stated as an absolute number, and information on testing for Montreal is nowhere to be found to put this into context.

At the very least, the Ontario Ministry of Health makes the daily data on COVID-19 in Ontario available (which I have been archiving in my github repository and used for this post and the previous ones), however it still reports the daily absolute percentage change front and center – percentage positive is buried in the daily epidemiological summary, a PDF report which I doubt most bother to find their way to – and I know all I’ve only ever seen seen reported in any of the news is about absolute numbers and case counts.

Still there is hope, and I think things are getting slightly better. I unfortunately don’t have the resources to cite, but from what I’ve heard in the federal briefings, Dr. Tam has been referencing metrics like percentage positive, and there are other examples as well (e.g. I believe this is an opening criteria for some U.S. states).

Unfortunately, this is not part of Doug Ford’s plan for re-opening Ontario as one of the criteria is “a consistent two–to-four week decrease in the number of new daily COVID‑19 cases”. 

That will number will, of course, depend on a number of factors, one which is the amount of testing being done. Another is the nature of probability and statistics.

 

Leave a Reply

Your email address will not be published. Required fields are marked *