Times Insider explains who we are and what we do, and gives a behind-the-scenes look at how our journalism comes together.
As of this morning, programs written by New York Times developers have made more than 10 million requests for Covid-19 data from websites around the world. The data we collect are daily snapshots of the ebb and flow of the virus, including for every US state and thousands of US counties, cities, and zip codes.
You may have seen snippets of this data in the daily maps and charts we publish at The Times. Combined, involving more than 100 journalists and engineers from across the company, these pages are the most-viewed collection in nytimes.com history and an integral part of the package of Covid coverage that awarded the Times the Pulitzer Prize in 2021 has for public service.
The Times’ coronavirus tracking project was one of several efforts that helped fill the gap in public understanding of the pandemic created by the lack of a coordinated government response. The Johns Hopkins University Coronavirus Resource Center collected both national and international case data. And The Atlantic’s Covid Tracking Project put together an army of volunteers to collect U.S. state data in addition to testing, demographics, and health facility data.
At The Times, we started with a single spreadsheet.
In late January 2020, Monica Davey, Editor at the National Desk, Mitch Smith, asked a Chicago-based correspondent to start gathering information on every single US Covid-19 case. One line per case, meticulously reported based on public announcements and hand-typed, with details such as age, location, gender, and condition.
In mid-March, the virus’ explosive growth proved too much for our workflow. The spreadsheet grew so large it stopped responding, and the reporters didn’t have enough time to manually report and enter data from the ever-growing list of US states and counties that we had to track.
At this point, many domestic health officials began putting in place Covid-19 reporting measures and websites to educate their citizens about the local spread. The federal government faced challenges early on in providing a single, reliable federal dataset.
The local data available was literally and figuratively all over the map. Formatting and methodology varied widely from place to place.
Within the Times, a group of software engineers in the newsroom were quickly hired to develop tools to expand the data collection work as much as possible. The two of us – Tiff is a newsroom developer and Josh is a graphics editor – would end up forming this growing team.
On March 16, the core application was largely working, but we needed help finding many more sources. To tackle this colossal project, we recruited developers from across the company, many of whom had no newsroom experience, to temporarily step in to write scrapers.
June 27, 2021, 8:40 p.m. ET
By the end of April, we were programmatically collecting numbers from all 50 states and nearly 200 counties. But the pandemic and our database both seemed to grow exponentially.
Also, some notable sites changed multiple times in just a few weeks, which meant we had to keep rewriting our code. Our newsroom engineers adapted by tweaking our custom tools – while they were in daily use.
Up to 50 people outside of the scraping team were actively involved in the day-to-day management and review of the data we collected. Some data is still being entered by hand, and everything is being checked manually by reporters and researchers, a process that takes seven days a week. The rigor of reporting and the fluency of the subject were integral to all of our roles, from reporters to data reviewers to engineers.
In addition to posting data on The Times website, we made our dataset publicly available on GitHub in late March 2020.
As vaccinations curb the virus’s toll across the country – a total of 33.5 million cases have been reported – a number of health departments and other sources update their data less often. Conversely, the Federal Centers for Disease Control and Prevention has expanded its reporting to include comprehensive figures, which were only partially available in 2020.
All of this means some of our own custom data collections be shut down. Since April 2021, the number of our programmatic sources has decreased by almost 44%.
Our goal is to get around 100 active scrapers by late summer or early fall, mainly to track down potential hotspots.
The dream, of course, is to complete our efforts as the threat from the virus subsides significantly.
A version of this article was originally published on NYT Open, the New York Times blog about designing and building products for news.