In early March 2020, Alexis Madrigal and Robinson Meyer sought to reveal how little COVID-19 testing had been conducted in the United States. Recognizing a critical need for publicly available, comprehensive data, they founded The COVID Tracking Project, which eventually relied on hundreds of volunteers who entered data manually on a daily and weekly basis for a year.
Throughout the project, we looked to automation as a way to support and supplement the manual work of our volunteers, rather than replace it. This kept us focused on making sure we understood the data and the conclusions we could draw from them, especially as the pandemic evolved.
The Data Landscape
As March 2020 progressed, some states launched COVID-19 data dashboards; others reported data in press conferences—or not at all. It quickly became apparent that daily, close contact with the data was necessary to understand what states were reporting. Madrigal told CNN’s Brian Stelter “At first, we just thought you could pull numbers off dashboards and that was going to be the main process. As it turned out, we actually needed to do deep research.”
Throughout the pandemic, COVID-19 testing and outcomes data have been a messy patchwork. States frequently changed how, what, and where they reported data. In the absence of clear federal guidelines, states were largely left to figure out how to publish data without help. Our core data entry team developed extensive processes to deal with this patchworked state reporting.
When we started the COVID Racial Data Tracker (CRDT), fewer than one fifth of US states reported race and ethnicity testing and outcomes data. As of March 7, 2021, 51 jurisdictions report race and ethnicity data for cases and for deaths. The quantity of data increased over the pandemic, but states continued to struggle with quality.
While the frustrations of state COVID-19 data reporting were obvious to all of our data entry teams, the Long-Term-Care COVID Tracker (LTC) team dealt with particularly agonizing discrepancies in long-term-care data—especially at the facility level. Without clear reporting standards, state long-term-care data is shocking in its deficiencies. States that reported data publicly often did so in wildly different ways, rendering comparisons across states exceptionally challenging.
The COVID-19 data that we collected—testing and outcomes, race and ethnicity, and long-term care—were largely unstandardized and fragmented. Had we set up a fully automated data capture system in March 2020, it would have failed within days.
What We Automated
Over the course of our data collection, we aimed to build increasingly strong automated tooling to validate and augment our manual work.
Zach Lipton, Data Infrastructure Co-lead, worked initially on urlwatch, a first step towards automation, with the goal of alerting volunteers when data on state pages and dashboards changed. Even reaching that relatively modest goal proved to be a huge undertaking: early on, there were often days when as many as ten states made major changes to their sites or rolled out new dashboards. Due to this whirlwind pace of updates and reporting changes, tracking states that published structured, machine-readable data was challenging and time-consuming. Dealing with states that used dashboard platforms like Tableau or Power BI, or which published their data through images on Facebook, proved deeply frustrating.
A core tenet of The COVID Tracking Project is data transparency. To show our work, Julia Kodysh, our Data Infrastructure Co-Lead, built an automated screenshots system that captured snapshots of state COVID-19 dashboards and pages. We configured hundreds of screenshots and eventually reached 100% coverage of our 797 core testing and outcomes data points.Screenshots were invaluable in building trust in our data; we amassed a trove of screenshots in our public archive that allowed us—and our data users—to verify our data with an easily accessible series of state COVID-19 data page snapshots.
Rebma, an infrastructure engineer, built an automated data fetcher that served to verify the testing and outcomes data that our volunteers entered during data entry shifts. This automated checking was tremendously helpful for the volunteers who entered and double-checked roughly 800 data points every day.
Carol Brandmaier-Monahan and Pat Kelly maintained a similar data fetcher system for CRDT race and ethnicity data that would pull the often-messy numbers into our entry spreadsheets for humans to validate. The system saved volunteers hours of entering data and allowed them to focus on its correctness.
Despite the hurdles of long-term-care data, Lipton was able to automate some tools to facilitate LTC data collection: facility-level data scrapers for Nevada and West Virginia’s Power BI dashboards, for Kansas’ Tableau dashboard, and for Utah’s ArcGIS dashboard. Before these scrapers, collecting data took the LTC team hours for each of these states. Additionally, Kodysh led our Data Infrastructure team in building an extensive data processing API for long-term-care data to address deficiencies in state reporting, like calculating true cumulative case and death counts in Kentucky to correct for a data reset that saw all facilities drop down to zero reported cases and deaths at the beginning of 2021.
Automation helped immensely to capture constantly changing data like time series. States often revised and updated time series as lagged data entered the reporting pipelines; for example, a state’s testing number for a particular day might be revised once additional tests from that day were reported to the state health department. Using scripts to fetch time series data daily, we were able to maintain an updated picture of this ever-changing data that we used for checks and research.
We automated more as the pandemic went on and as state data became more stable, but we never fully automated any data point in our testing and outcomes, CRDT, or LTC datasets.
The Benefits of Human Involvement
Not automating has an obvious cost—human labor. We estimate that our volunteers spent more than 20,000 hours doing data entry alone. But there were clear benefits, some of which weren’t easy to foresee as we built out processes at the beginning of the project.
Seeking out and manually entering each data point gave us a detailed understanding of the data that we would not have been able to develop had we automated data collection. We knew when new data points were added and came across caveats and notes posted on state data pages. When metrics changed abruptly, we sought out explanations that states sometimes posted to explain issues like data dumps and reporting lag. The perils of failing to do so are evident: for example, on March 18, 2021, CNN reported dramatic COVID-19 case spikes in Alabama and Delaware, citing data that had been collected automatically—and missing notes that both states published on their dashboards clarifying that the increases resulted from backlogged cases. Additionally, we learned what was normal and what was abnormal on a state by state basis, enabling us to make informed decisions when handling reporting anomalies.
Another benefit: human eyes on—and heavy use of—data pages and dashboards enabled us to give informed feedback to states, ranging from comments about website accessibility to requesting that states publish more data to correcting errors. For example, on November 5, 2020, Hawaii was incorrectly summing numbers after a transition to a newer data processing method—reporting 16,096 cumulative hospitalizations on one dashboard and 1,125 on another. We noticed the discrepancy between the two dashboards and submitted feedback to a contact at the state Emergency Management Agency, after which Hawaii corrected the mistake—1,125 was the correct total—and explained what had happened.
Visiting state data pages each day meant that we discovered new metrics soon after they were first posted. Had we relied on scripts to pull data, we would have taken much longer to discover newly reported data.
A project like ours requires experience in journalism, computing, reporting, and science communication, to name but a few disciplines. More basically, though, we needed huge numbers of volunteers with some familiarity with spreadsheets to enter data. As such, we ended up recruiting hundreds of volunteers into data entry. Working data entry shifts allowed our volunteers a taste of the work and of our unique culture, offering an excellent introduction to the varied opportunities to contribute within the project. After starting in data entry, many volunteers moved on to roles in editorial, research, data science, and other teams.
Manual data entry provided a ritual for volunteers. We collected long-term-care data weekly and race and ethnicity data twice weekly. And day after day, through weekends, holidays, and losses both personal and global, we collected testing and outcomes data for 56 jurisdictions. Our data entry teams formed the core of our community: data collection is rote work, but when done as part of a supportive group, it can offer a sense of purpose and community.
What We Learned
The COVID Tracking Project operated in a data landscape without consistent reporting standards. We worked with constantly changing dashboards, dashboards designed to be hard to scrape automatically, and even changing data definitions. Building automated tooling facilitated our manual data collection and was crucial to our ability to verify data. Ultimately, though, human scrutiny was needed to develop a deep understanding of the data.Perhaps most importantly, collecting data manually allowed us to build a culture of curiosity and care about the data in a way that we could not have done if we were a primarily automated project.
Additional contributions from Julia Kodysh, Michal Mart, and Theo Michel.
source : https://covidtracking.com/analysis-updates/why-we-didnt-automate-our-data-collection