countyname
is State
, or possibly A State
.I added the “deaths by day” to the data collected from the ArcGIS feature server. Note that the deaths included in this data are only Florida residents, and include residents who were diagnosed/isolated outside of Florida.
The health metrics data snapshot is only saved if the data has changed since the last update.
The location and file name format of the PDF reports was changed on 2020-06-04. I noticed on 2020-06-08 and updated my scripts, but the reports from 2020-06-04 through 2020-06-07 were not stored.
Reports on serology and antibody testing were added, presumably at the same time as the change in report location and naming (guess: 2020-06-04).
I updated the PDF extraction scripts to fix the extraction of the lab testing summary tables. In working on this, I noticed that the column headers have changed three times since these reports have been published:
The total tests reported on the dashboard between 2020-05-15 19:00 and 2020-05-16 08:00 EDT were out of range and are clearly erroneous.
The data updates seem to be happening later in the day after 1pm, so I’ve gone back to using the last snapshot of the day for the day’s counts.
The event_date
column was removed on 2020-05-07 from the line list and as of 2020-05-08 the age
column is missing for all cases. It’s likely that the removal of the event_date
is related to recent news coverage about early cases of coronavirus, see issue #17 for more discussion. I may have missed a day of updates (2020-05-07).
The URL for the cases by day timeseries API changed, I may have also missed an update here.
Added Health Metrics data set that was released on 2020-04-28. https://fdoh.maps.arcgis.com/home/item.html?id=492f02b473df4c8f8aa68fedf420c933
Fixed directory names in pdfs/
so that they do not include ":"
(#18).
The evening update on 2020-04-22 may have come later than usual and my snapshot may have missed the last update of the day. I’ve manually adjusted the dash_summary
and arcgis_summary
tables to match.
As of 2020-04-23 in the morning, the line list is currently not available. This is causing the dashboard to show “0 Total Cases”. I adjusted the snapshot scripts so that updates will be stored only when the table is available.
At some point in the last few days, the Cases variable in the Line List table served via ArcGIS changed from a numeric timestamp to a character d/m/Y HH:MM
timestamp.
Similarly Florida resident deaths were previously reported in the variable fl_res_deaths
, which was renamed c_fl_res_deaths
in the county-level summary table served via ArcGIS.
Additional tables that have been added to the dashboard are now also available via the arcgis API. I’ve added the Cases By Zip and Cases by Day tables to the data sets. The raw geojson
for cases by zip are also included in data/geojson
.
I fixed an issue with the line list query to better utilize the arcgis API. I mised one or possibly two updates to this table in since 2020-04-08.
The division of ages in the case count by age range summary on Page 2 of the PDF report has changed from 0-9, 10-19, etc. to 0-4, 5-14, 15-24, etc. in the 2020-03-25-1007 report.
The order of the columns in the county_testing
table changed with the 2020-03-25-1007 report to
county
, pending
, inconclusive
, negative
, positive
, percent
, total
and the heading of the page was changed from All persons tested to All persons with tests reported.
PDF processing is now part of the automatic update. Failures in the extraction process are recorded in the README.md file in the pdf-specific data folder in pdfs/
.
Added a few more strategies for parsing the Line List tables in the PDF repots that fixed a few issues in previous extractions and should make the process more robust.
Added historical PDF reports from 2020-03-16,17,18 that were found via Google
Internal Fixed two issues with the Line List tables extracted from the PDFs. One of the issues caused the extraction to fail for the 2020-03-22 18:28 report, which is now added. The other issue caused cases who are listed in both the line list of cases and the line list of deaths to be double counted. The cases who died are now removed from the line list of cases (matched based on county, age, sex, date of counting) and death is indicated via the died
column.
Added plot showing age distribution by sex among positive cases
For better organization, I’ve moved all of the data files into the same folder: data/
. Files that are not actively updated are stored in old/
.
covid-19-florida-tests.csv
-> data/covid-19-florida_dash_summary.csvcovid-19-florida-cases-county.csv
-> data/old/covid-19-florida_dash_county.csvcovid-19-florida-doh.csv
-> data/old/covid-19-florida-doh.csvpdfs/data
to data/ and the covid-19-florida-pdf-
prefix was added.The county-level confirmed cases table now spans multiple pages (2-3), as of the 2020-03-22 09:51 report.
The community spread table added in the 2020-03-21 10:08:00 EDT report was removed in the next report (2020-03-21 17:31:00 EDT).
A column for people with inconclusive test results was added to the county-level testing table on pages 5-6 of the 2020-03-21 17:31:00 EDT report.
The text for the count of tests awaiting results at BHPL changed from Awaiting BPHL testing to Awaiting BPHL (state public health lab) testing in the 2020-03-21 17:31:00 EDT report, and then back again to the original wording in the next iteration (2020-03-22 09:51:00 EDT).
Internal I added try_safely()
to minimize parsing errors that might break the complete pdf processing pipeline. I also muffled parsing of pdf text tables with quietly()
.
The county deaths label was changed to deaths in the afternoon of between 2020-03-21 11:26:37 EDT and 2020-03-21 16:26:06 EDT.
Internal Use the negatives reported in the PDF and the positive, pending, and deaths reported in the dashboard for the plots on the front page. Also replace change in “resolved” cases with plot of new deaths.
Fixed issue with dashboard html processing so that the values on the main display are used preferentially over values in inner tabs.
A community spread table was added to the 2020-03-21 10:08:00 EDT report and includes cases where travel or contact with a confirmed case cannot be established as the source of infection.
The dashboard now reports county deaths and florida deaths. The former is now preferentially used for the number of deaths in the plot in the README of this repo.