Analysis and Visualization of the Ontario COVID-19 Data (Part 2)


In Part 1, I looked at the trends and did some visual analysis of the COVID-19 case status data for Ontario. This also required some rather involved data acquisition, as the information on COVID for Ontario was only available on a daily basis on the Ministry of Health’s official COVID page.

Since that time, finally the data has been made open and is now available on the Ontario Open Data page:

Status of cases in Ontario:
Confirmed Positive cases:

The second of these being data with important information like geographic and demographics that I lamented I could not locate previously.

So here for Part 2, we will dive into the confirmed cases data to see if there are any simple insights we can uncover of those positive cases of COVID in Ontario.

The same disclaimers from part 1 and then some also hold for this post (I am not an epidemiologist, interpretations of the data and opinions my own, my contain errors, I am not associated with Ministry of Health, etc.)


The data is on confirmed positive cases of COVID-19 in Ontario, taken from the second link listed above. Unfortunately no data dictionary is provided, but I’ve done my best to interpret the included fields and most are fairly straightforward:

ROW_ID – an index for the row (case) id
ACCURATE_EPISODE_DATE – a date, presumably of when the case was confirmed positive. The ‘accurate episode’ makes me wonder if an adjustment has been applied here in some way.
Age_Group – demographic – age bucket in decades, ranging from 20’s-90’s, + those less than 20 and ‘unknown’
CLIENT_GENDER – the gender of the case
CASE_ACQUISITIONINFO – How the case was presumably contracted (source of transmission)
OUTCOME1 – probably the most important field – the cases status. Either ‘Resolved’, ‘Not Resolved’, or ‘Fatal’ – below I also refer to ‘Not Resolved’ as ‘Active’ and ‘Fatal’ as ‘Deceased’
Reporting_PHU, Reporting_PHU_Address, Reporting_PHU_City, Reporting_PHU_Postal_Code, Reporting_PHU_Website, Reporting_PHU_Latitude, Reporting_PHU_Longitude  – geographic information on the public health unit (PHU) reporting. For now, I’ve only used city and lat/lon

To get some insights fast, it’s going to be some quick-and-dirty insights whipped up in Tableau – pardon my lack of axis labelling and the like, but the text around the visuals should provide the proper context. Also for reference, the total number of cases in the data set 3,630, current as of being pulled 2020/04/05.


Status and Acquisition
Looking at the overall status, we can see about 2/3 are active, 1/3 are resolved, and the remainder (~2.6%) are fatal (I’ve redundantly color-coded these to be consistent with later visuals):

That 2.6% is the number deceased over the total number of known cases or the Case Fatality Ratio (CFR), which you may have seen a lot in the news and my previous post, and we’ll come back to shortly.

Looking at acquisition, it appears that most are unknown (1,757 or nearly half) and the remainder are divided somewhat evenly in descending order (Travel-Related, Neither, Contact of a Confirmed Case).

I’m curious to know what ‘neither’ here corresponds to – I assume people that have COVID that haven’t travelled nor said they’ve been in contact with someone who is a confirmed positive – which, in my mind a least, raises more questions about the number of cases out there in the general population that are asymptomatic.

First let’s take a look at the breakdown of positive cases by city:

You can see that Toronto makes up the largest number of cases at over a quarter, and that cases in the GTA nearly half (~46.5%, or 1689 cases). After that the largest number of confirmed cases in is Ottawa, with 309 (~8.5%).

If you prefer the above in a map view we can also visualize using latitude and longitude for the PHUs which are included:

Again, you can see cases are focused in Central Ontario / GTA and Ottawa. Very small numbers present in smaller cities, which I would guess probably has more to do with the amount of testing being done / available.

For the demographic breakdown, let’s first look at gender:

This is one of the few cases where I would use a proportional stacked area graph because there are really only two categories to compare across. These numbers are pretty close to 50/50. There was ‘Unknown’ case which is still active as well as one transgender which is resolved.

Next, let’s look at age:

Hmm, looks kind of skewed normal to me – and actually, maybe not that dissimilar from the actual age distribution of Ontario for 2020 according to Statistics Canada?

I wouldn’t confidently say that COVID contraction discriminates by age nor gender.

Age and Status and Case Fatality Ratio (CFR)
Now to look at the case statuses by age bucket. The colour coding here is consistent with our original first graph, and is redundant given the column headings, but I find does make this much easier to interpret:

You can see that by status, proportionally speaking, there are slightly more active cases on the tail end vs. resolved; most notable in this graph in red on the right, is that the largest proportion of fatal cases is those who are very old (which is already a well-known fact). More than 80% of the deceased are 70s-90s.

I’ve also made sure to label the columns by the number of observations (n) as here we are looking at percentages but the number of data points are quite different.

If we look at the proportion of cases which are fatal (CFR) this pattern is even more striking:

Nearly a quarter of cases for those in their 90s were fatal, whereas for those less than 70, the CFR is less than 2%, and less than 60s is ~0.5%.

We should bear in mind, however that (thankfully) these numbers are even smaller and so there will be a high amount of variation if we look at ratios/percentages. We can seen this by looking at the number deceased (n) vs. the CFR:

You can see that age buckets with less data (total cases) are generally older and have higher CFRs so there is going to be more variability here – but you can also see that for those <20, the rate is 0 with a small sample size.

I don’t think there’s any doubt around the fact that those whom are older are more at risk if they contract COVID – I just wanted to make this graph to highlight the importance of thinking about sample size when looking at things like percentages. This is sometimes mentioned in different posts I’ve seen (very infrequently in the news), but more-often-than-not  overlooked, and is only one of the factors why CFR can vary so widely (e.g. by geography).

Something else that is missing here is whether any of the cases had any underlying conditions.

Cases and CFR by Day
If we look at the daily trending daily by day, we can get some interesting insight about CFR:

You can see the ratio varies quite widely from 0% on days which there were no new deceased, to as high as over 14% near the beginning. In aggregate, the variation steadies over time as the number of cases (n) becomes larger and you can see that aggregate CFR for Ontario across all geographies and age buckets has stabilized somewhere around 2.6%.

You can see that first ‘deceased’ case is on the 1st of March, whereas in the case status data, the first reported deceased is not until just before the lockdown for Ontario began, on March 17th.

This makes me think that the dates in the confirmed cases data are adjusted, whereas those in the case statuses are not. And, because there is a delayed in contracting the disease and succumbing to it, you will get fairly drastically different case fatality ratios depending on how you calculate it.

I think there was a considerable change in the reporting, as the case status data now seems to be in-line with what we see here in the confirmed positives, whereas previous numbers showed quite differently (as in my previous analysis)

e.g. the CFR for March 28th from cases statuses is 1.66% (19 deceased over 1,144 cases) whereas for confirmed cases it is 2.7% (78 deceased over 2,879).

Either there is something very different being reported here (which is my hope, that dates are adjusted in the case data) as there is a fairly large discrepancy between two. Reassuringly, the data do seem to be in-line as of most lately.


To summarize, for Ontario nearly 2/3rds of the positive cases remain active, a 3rd resolved, and the aggregate case fatality rate (proportion of positives deceased) is ~2.6%.

At the moment, the largest number of cases tested positive remains in the Greater Toronto Area, and there doesn’t seem to be strong demographic pattern in those have have contracted COVID.

We have seen here again as has already been well-established, how dangerous the disease is to the elderly, and case fatality rates which fall within ranges similar to those seen elsewhere by demographic. What will be of interest will be seeing if these numbers evolve with time and how COVID-19 is changing within Ontario – which will likely come back to using the case statuses data again.

The data from the Ministry website I am continuing to archive (and to a degree, consolidate) – it can be found on my github here:

For part 3, I will continue to work with the case status updates, as there is more data now being made available on a daily basis, including hospitalizations.

2 thoughts on “Analysis and Visualization of the Ontario COVID-19 Data (Part 2)”

  1. Hi,

    In your post from early April you provide guesses at the meaning of each field in the Ontario Covid data set you grabbed from the link below. It looks like they’ve posted a data dictionary now so we no longer need to guess. The date field was the troublesome one. It appears to be approximate onset.

    Personally, I was using the table to plot moving averages for Ottawa because Ottawa Public Health’s own reporting just shows daily numbers which jump around.

    1. Thanks Brian – I suspected as much. This would explain the discrepancy between looking at reported date and approximate onset date (which appears to be a back-dated estimate) on a per-case basis.

Leave a Reply

Your email address will not be published. Required fields are marked *