- I am not an epidemiologist, nor am I a subject matter expert on disease or public health policy
- All handling of the data, interpretations thereof, and thoughts expressed are my own and only my own
- Interpretation and handling of the data may contain errors or omissions given the above
Everyone’s talking about coronavirus. It’s hard not to think about, hard not to read about, on a daily basis. It’s almost impossible not to – it’s top of mind in everyone’s mind, the only thought permeating our collective consciousness at this moment in time.
One result of everything that’s happening right now is that there’s a lot of data flying around out there, and a lot of articles being written, and a lot of analysis and visualization work being done, as everyone is trying to make sense of this whole business as it unfolds in real-time.
So I thought it was time to finally break down and weigh in myself. And then I thought twice, and thought it would be better to not just to do that, but instead try to make a contribution in some way.
I’m not a whole department, I don’t think I can build anything like John Hopkins CSSE has. Nor am the CDC. They have produced some good visualization as well around COVID. So I thought I’d tackle something simpler and closer to home. Something I keep reading about, which is how Toronto and Ontario are handling the situation.
The Ontario Ministry of Health and Long-Term Care (or MOHLTC) maintains a web page with updates on the status of COVID-19 testing in Ontario every morning at 10:30 AM EST and evening at 5:30 PM EST here: https://www.ontario.ca/page/2019-novel-coronavirus
This page has two important pieces of information that are updated every day:
- Status of Cases in Ontario – number of positive, negative, open, deceased and resolved cases, and total tested
- New confirmed positive cases – demographic and status information on new confirmed positive cases as they occur (please see my note regarding)
The monumental effort that is no doubt required to consolidate and condense all the information necessary and keep it up to date and accurate is to be applauded and celebrated. During this time nothing can be more important than keeping the public informed.
Unfortunately, that being said, for myself and many other people, this information isn’t particularly useful for a couple reasons:
- The previous information disappears with each update, and there doesn’t seem to be any kind of historical record publicly available.
- The information is not presented or available in a format that lends itself to analysis. Good for reading, and understanding the raw numbers as a current status, but not so good for analysis – what can be done to make that a little better?
So I’ve started to put something together, and without burying the lede any further, if you’re interested you can check it out on my github here: https://github.com/mylesmharrison/covid-ontario
Right now I am saving the COVID case status data and confirmed cases on a daily basis in Excel format and csv, and am also working to automate this process, along with and data processsing and visualization.
In my searches, I haven’t been able to locate any other sources doing this, with the exception of the MIDAS network, who are also saving the daily case status information. I am incorporating their historical data into my repo as well, and it is that primarily used in the section below.
Note: If you have access to a historical archive of the patient details in ‘New confirmed positive cases’, please let me know. I have been unable to locate an archive of this information online. (UPDATE: After I began saving it daily, it appears his information is on hiatus as of 2020/03/27)
Another disclaimer is that the code is currently fairly rough and should be considered work-in-progress (like I said, I am just one person).
Like any good piece of analysis, I think it would be best if I stated my methodology and assumptions up front. These mostly have to do with the facts that:
- I am not an epidemiologist nor subject matter expert as I already noted
- Regrettably, the way the data were labelled on the MOHLTC website changed over the weeks (or, as we data folks would say, the schema changed), so I have collapsed the following columns based on the presence of NAs which I zero-filled in order to do so:
- Total number of patients approved for 2019-nCoV testing to date + Total number of patients approved for COVID-19 testing to date => total_approved_for_testing
- Presumptive positive + Confirmed positive => presumed_or_confirmed_positive
- Presumptive negative + Confirmed negative => presumed_or_confirmed_negative
If the above is not methodologically sound, I will correct at a later date, and it will be sufficient to only look at data after 2020/02/28 after which the schema has remained constant.
So far, I’ve just been able to make a few plots rather manually. Note that unlike the many other data sources on COVID-19, this contains information on the total number tested which is an important detail.
First a summary of all the metrics we have available:
We’ve seen a lot of these types of figures lately. In my experience, people are generally pretty bad at interpreting log-scaled graphs and all the subtleties they encompass. That matter aside, the point I want to make here is that with the exception of resolved cases (gray) all of these metrics appear to be growing exponentially and so we should look at them in context.
Cases Under Investigation
The purple line in the summary figure is the number of cases under investigation. It took me a while to realize that the structure for the status update table is pretty simple and the following formula has generally held true the whole time (cases under review excepted, as in footnotes):
Total tested = Negative + Positive + Resolved + Deceased
Total approved for testing = Total tested + Under investigation
So a case either has a status, or it’s under investigation – so here our the numerator should be the number under investigation and denominator the total number of open cases (i.e. approved for testing), to give the proportion that have yet to receive a result.
You can see that as the number of people getting tested ballooned, there was eventually a problem with how many that could be handled; things were the worst right about a week after St. Patrick’s Day when Ontario made a Declaration of Emergency, but lately seem to be getting more under control, as the percent under investigation has dropped from around 30% back down to about half that.
Also for context (not pictured), from March 9th to today, the number approved for testing grew from 2,403 to nearly 50K (49,186)
Finally we can see the inverse relationship between the total tested and under investigation as they add to 100% of the total approved for testing:
Confirmed or Presume Positive
Now we can take a look at the proportion of total cases tested (that is, those with a status and not under investigation) that were actually positive – so here our numerator is the number of positive cases, and the denominator the number of total tested (positive + negative + resolved + deceased). We can see that the percentage of total which are positive ranges from around 0-3% and has recently increased.
Also interesting to note there is more variation earlier on (prior to Feb 22) as these numbers were very small, and actually decreased because these were presumed positives not confirmed.
Confirmed or Presumed Negative
As the proportion of positive cases increases, the proportion confirmed negative is decreasing. The most recent number is 96.7% – so around 3 in every 100 test results thus far is a positive case of COVID-19 (in aggregate), as we saw above.
It’s unclear to me what is happening here – I didn’t even bother plotting as a result. I initially thought this was the number of patients in-hospital that have recovered, but I no longer believe that to be the case as the number only ranges from 1-8 and there is very little movement over the weeks.
Finally, we look at the number for deceased. Fortunately, the line in red for deceased cases does not begin until March 17th, although alarmingly in the summary graph it appears to rise quite precipitously. One deaths is too many, but I promised to put things in context, so remember that this is a log graph and that steep incline represents rising from 1 to 23 deaths from March 17th to 29th, and comparatively the total number of cases tested in the same period rose by over 32K (32,381).
If we look at the percentage deceased out of total tested, we see this increase, but the overall mortality out of those tested is <0.1% (for the final day as of this writing, 23/41,985~=0.055%):
However, that doesn’t make much sense as we actually want to compare deceased to those that actually tested positive, so here our numerator should be deceased and our denominator positive cases. This is that epidemiological metric that has become much more commonly-known as of late, the case fatality rate:
There is a lot of variation here because the numbers are small – and you can see this graph only goes from March 17th when the first fatality occurred – but it ranges between ~0.5% to nearly 2% (max is 1.97%). This seems to be in-line with numbers in some other geographies e.g. currently for Canada as of this writing, there are 61 fatalities out of 6,258 confirmed cases for a CFR of approximately 1 percent (~0.98%), though obviously this number currently varies widely by geography.
So what does this mean? I’m not sure, to be honest – like I said, I’m not an epidemiologist. But I think it’s important to put the numbers in context; while the media happily reports absolute numbers of positives and the day-over-day increase in positive cases, I have yet to see metrics like the percentage positive out of tested reported or mentioned, at least for Canada.
Perhaps it’s because they don’t make headlines. Or maybe perhaps it’s because they’re not actually useful – what do the relative proportions actually tell us, from a statistical standpoint? (This I still need to think on).
Other folks smarter than me have also noted that CFR is a complex metric, as there is obviously a delay between infection and mortality, and also that since numbers are small there may by considerable variation (the 95% confidence intervals can wide). Of course, these metrics are crude, as the total numbers are aggregate over time (i.e. there are more observations at the end than at the beginning) – and another logical next step would be to look at rates of change of these proportions which might be more meaningful.
I think one thing I can definitely make up my mind about is that I don’t want the job of an epidemiologist, even on a good day – the stakes are high and it seems to me there just isn’t enough data we can get accurately, let alone in a situation such as this one where it’s unfolding before us. If we don’t know the denominator, how can we possibly understand how lethal the disease is, or even begin to make a crude estimate of the actual proportion of the population that is infected? Even for H1N1, research has shown in retrospect that CFR estimates over time varied by as much as a factor of 104, depending upon how they were measured. For COVID-19 the numbers vary pretty widely by geo as well – in this article from a couple weeks ago from less than 1% to the very high numbers in Italy (>10% at the time). That’s a lot of uncertainty.
As many other people other than me have opined, the problem is that there is just not enough testing be done – while the CBC said this is causing us to underestimate the extent of the virus (and to create another striking red graph of rising confirmed cases), other sources such as The National Post note the silver lining by stating with cautious optimism that perhaps we are overestimating the fatality rate, as countries which have notably seen success in widespread testing (notably South Korea) have a similar CFR.
Asymptomatic cases aside, the sample of those that test in Ontario is obviously highly biased, as only those who suspect they have COVID are instructed to get tested due to logistical constraints (selection bias is also noted in the CEBM article above). In an ideal world, if those tested were a truly random sample of the subset of the population, I feel like it would be a fairly straightforward application of statistics to estimate a range for the infection rate (one sample z-test of proportion? 95% CIs for a proportion?).
But I guess it really depends on what the goals are, in order to determine what number we are really after and need to make informed decisions now.
Besides taking the recommended precautions, all I can do is keep collecting the data and trying to make sense of it as this unfolds. Unfortunately for us, to paraphrase a famous philosopher, life is lived forwards and can only be made sense of backwards.
Again, if you’d like to contribute or use any of what data I have collected thus far, it is available here: https://github.com/mylesmharrison/covid-ontario
UPDATE (2020/03/30): It appears the Ministry of Health has abandoned posting confirmed cases entirely as of today, and has also now changed the format of their status update and is providing more detailed reports on a daily basis ([PDF])