# Dengue Forecasting Project

Welcome to the Dengue Forecasting Project

Several departments in the U.S. Federal Government (Department of Health and Human Services, Department of Defense, Department of Commerce, and the Department of Homeland Security) have joined together, with the support of the Pandemic Prediction and Forecasting Science and Technology Interagency Working Group under the National Science and Technology Council, to design an infectious disease forecasting project with the aim of galvanizing efforts to predict epidemics of dengue.

This interagency project will assess forecasts using historical data from Iquitos, Peru and San Juan, Puerto Rico. On this webpage you can find the forecast targets, dengue data, environmental data, submission procedure evaluation criteria, and a detailed description of the project. Information about dengue is available from the Centers for Disease Control and Prevention. Any updates will be posted here and on the NOAA Dengue Forecasting Page where these materials are also available. For submissions or questions about the project contact predict @ cdc.gov.

Timeline
• Training data release for model development: June 5, 2015
• Model description and training forecasts due: August 12, 2015
• Testing data release for model evaluation: August 19, 2015
• Testing forecasts due: September 2, 2015
• Model and forecast evaluation workshop: date to be decided, mid to late September, 2015
Project Description

The project description can be downloaded as a pdf or word document:

Timing of peak incidence

DEFINITION

The week when the highest incidence of dengue occurs during the transmission season.

Forecast the week when weekly dengue incidence is highest, specifically the variable total_cases in the dataset. The week corresponds to the week of the transmission season (season_week in the dataset). This week corresponds to the number of weeks after the historically lowest incidence week of the year for each location.

MOTIVATION

A forecast of the week of peak incidence can help health officials target prevention messages and activities. It could also help hospital personnel make approriate decisions about resource allocation (e.g. staffing) to ensure optimal patient care.

Maximum weekly incidence

DEFINITION

The number of dengue cases reported during the week when incidence peaks.

Forecast the peak weekly incidence for the transmission season. This is the maximun value of total_cases for each season in the dataset.

MOTIVATION

A forecast of the peak incidence can help health officials anticipate the burden of disease. If transmission is intense, more resources may be allocated to help control the epidemic. It can also help hospital personnel prepare for surges in visits to ensure that all patients receive optimal care.

Total number of dengue cases in a transmission season

DEFINITION

The total number of dengue cases reported in the transmission season.

Forecast the total number of dengue cases reported in a transmission season. This is the sum of total_cases for each season in the dataset.

MOTIVATION

A forecast for the total number of cases in a transmission season can help with long-term resource allocation. If many cases are expected throughout the season, prevention and control resources may be allocated differently than if a large transient surge is expected.

Data for developing and evaluating predictions

This page contains links for dengue data for San Juan, Puerto Rico and Iquitos, Peru. Initial data was available through the 2008/2009 season for each location for model training and forecasting for the 2005/2006 to 2008/2009 seasons. The complete dataset, including data through the 2012/2013 was provided after the training phase for the testing of fully developed models on the seasons 2009/2010 to 2012/2013.

San Juan, Puerto Rico - 1990/1991 to 2012/2013

This dataset includes the number of laboratory-positive dengue cases by week for the San Juan-Carolina-Caguas Metropolitan Statistical Area. The data covers the 1990/1991 to 2008/2009 dengue seasons and includes serotype-specific incidence. Complete details can be found in the metadata.

Training data - 1990/1991 to 2008/2009

Testing data - 1990/1991 to 2012/2013

Iquitos, Peru - 2000/2001 to 2012/2013

This dataset includes the number of laboratory-positive dengue cases by week for Iquitos, Peru. The data covers the 2000/2001 to 2008/2009 dengue seasons and includes serotype-specific incidence. Complete details can be found in the metadata.

Training data - 2000/2001 to 2008/2009

Testing data - 2000/2001 to 2012/2013

Environmental data

NOAA data sources are compiled for easy comparison with dengue health data. These environmental data are from Iquitos, Peru and San Juan, Puerto Rico to correspond with the available dengue data. All data are provided in a csv format for easy access. All data are maintained and quality controlled by the programs that manage these data sources. Please be aware that data may still potentially contain quality issues and any user should inspect these data before use.

Environmental data provided for this study are from a variety of sources (ground observations, remote sensing, and reanalysis). Each source has different limitations and quality issues. Users of these data should be careful to utilize data that provide the most confidence for the actual representation of the surrounding conditions. Ground observations are generally an optimal representation of the actual local conditions. Remotely sensed observations are generally an excellent observation of precipitation and vegetation conditions for a location. Reanalysis data are a good representation of the conditions of a given area, especially when other data sources are not available.

Users should also be aware that these environmental data are at varying spatial scales. Station data are at a point-based estimate of the local climate conditions; whereas, the reanalysis and remotely sensed data are grid cells. For more information, please read the following descriptions to learn more about the differences.

Below is a description of the environmental data sources that are provided for analysis with dengue data.

Station data - Temperature and precipitation - NOAA's GHCN daily climate data

The GHCN stations with daily temperature and precipitation data are available for both locations.

Additional data access and full description of data can be found here: https://www.ncdc.noaa.gov/oa/climate/ghcn-daily/

Individual stations that are inside or near the city are listed here: San Juan, Puerto Rico: RQW00011641; Lat = 18.4325; Long = -66.0108; Elevation = 2.7; Start = 1956; End = 2015

Iquitos, Peru: PE000084377; Lat = -3.783; Long = -73.3; Elevation = 126; Start = 1973; End = 2015

Temperature values are in Celsius; Precipitation values are in mm.

Values include Maximum Temperature, Minimum Temperature, Daily Average Temperature, Diurnal Temperature Range, and Daily Precipitation.

Be aware that some stations may have missing days and/or missing values. Missing values are identified as -9999. Missing days are not identified in the record.

Station observations for San Juan are complete and have few missing values. However, there are multiple missing station observations for Iquitos and users should refer to other data sources provided.

Satellite precipitation - Precipitation - NOAA's CDR PERSIANN Precipitation Product

PERSIANN is a global climatological data record of precipitation from remote sensing information using an artificial neural network. These data are available for each city on a daily basis from 1983-present.

The resolution of this data product is 0.25x0.25 degree.

Additional data access and full description of data can be found here: http://www.ncdc.noaa.gov/cdr/operationalcdrs.html

Precipitation data are from the grid surrounding the station located in the city.

Precipitation values are reported as daily sums in mm.

Missing values are listed as -9999.

Reanalysis - Temperature and precipitation - NOAA's NCEP Climate Forecast System Reanalysis

Climate Forecast System Reanalysis is global, high-resolution, coupled atmosphere-ocean-land surface-sea ice system to provide the best estimate of the state of these coupled domains over the period of record.

Data are available from 1979-present.

Additional data access and full description of data can be found here: http://rda.ucar.edu/datasets/ds093.0/#metadata/detailed.html?_do=y

CFSR provides a variety of data that are not easily accessible from other data sources. This includes relative humidity, specific humidity and dew point.

Temperature data values are available in Kelvin.

Resolution of the grid is 0.5 degree.

Satellite vegetation - Normalized difference vegetation index (NDVI) - NOAA's CDR Normalized Difference Vegetation Index

NDVI CDR is a global climatological data record of vegetation. These data are available for each city on a weekly basis from 1981-present.

The resolution of the product is at 0.05x0.05 degree.

Additional data access and full description of data can be found here: http://www.ncdc.noaa.gov/cdr/operationalcdrs.html

Four pixels closest to the city centroid are provided for evaluation of vegetation change.

30-year climatologies were also provided for these sites.

NDVI data are provided without quality flagged data removed. Quality flagged data are provided at the CDR webpage.

Population data

This page contains data for the population sizes for San Juan, Puerto Rico and Iquitos, Peru.

San Juan, Puerto Rico - 1990, 1999 to 2014

This dataset includes estimates for the population of the San Juan-Carolina-Caguas Metropolitan Statistical Area for 1990 and 1999-2014 from the U.S. Census Bureau: 1990, 1999, 2000-2009, and 2010-2014.

Iquitos, Peru - 2000 to 2014

This dataset includes estimates for the total population of the four districts in the metropolitan area of Iquitos (Iquitos, Punchana, Belen and San Juan Bautista) from the National Statistics Institute of Peru: 2000-2014.

How to submit forecasts

Forecasts will be made in two stages. First, the training data should be used by each team to develop and select the optimal model for each prediction target and location: San Juan, Puerto Rico and Iquitos, Peru. Once this has been accomplished, the team should write a brief description of the model and data used. If different models are used for different targets or locations, each model should be described. The team should also prepare forecasts for the years 2005-2009 using the selected model(s). For each of these four transmission seasons, forecasts should be made every 4 weeks (weeks 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48) for each target. Each forecast should include a point estimate and a probability distribution. Note that forecasts should be made for peak incidence even after peak incidence has occurred. These forecasts reflect the probability that peak incidence has already occurred (e.g. late season forecasts should be non-zero if there is some chance of a second, higher peak).

One “csv” file should be prepared for each location and each target using the supplied templates. The initial model description and forecasts should be submitted to predict @ cdc.gov by August 12, 2015. These forecasts will be used to verify that the format is correct and to provide metrics on fit to the training data.

All teams with verified submissions by August 12, 2015, will receive the testing data by email in the same format as the training data on August 19, 2015. They will have two weeks to submit forecasts for the 2009-2013 testing period using the already selected model. These forecasts should use exactly the same model and same format as the first submission and must be submitted to predict @ cdc.gov by September 2, 2015.

IMPORTANT NOTE: Much of the data for 2009-2013 is currently accessible to researchers; it is therefore contingent upon the researcher to NOT use these data for model development or evaluation. The data are supplied only for “future” forecasts within the testing period. For example, forecasts made for the 2011/2012 season at Week 4 may use data from any date up to Week 4 of the 2011/2012 season, but no data of any type from later dates. The data may be used to dynamically update coefficients or covariates, but there should be no structural changes to the model and no consideration of data from Week 5 of that season or later.

San Juan, Puerto Rico - Training templates

Iquitos, Peru - Training templates

Model description

Once model development has been finished, each team should select their best model for future forecasts. Note again that there may be different models for different targets and locations, but only one for each target and location (though that may be an ensemble model). If different models are selected for different targets/locations, the description should include each of those models. The description should include the following components:

1. Team name: This should match the name used in the submission file names.
2. Team members: List every person involved with the forecasting effort and their institution. Include the email address of the team leader.
3. Agreement: Include the following statement: “By submitting these forecasts, I (we) indicate my (our) full and unconditional agreement to abide by the project's official rules and data use agreements.”
4. Model description: Is the model mechanistic, statistical? Is it an instance of a known class of models? The description should include sufficient detail for another modeler to understand the approach being applied. It may include equations, but that is not necessary. If multiple models are used, describe each model and which target each model was used to predict.
5. Variables: What data is used in the model? Historical dengue data? Weather data? Other data? List every variable used and its temporal relationship to the forecast (e.g. lag or order of autocorrelation). If multiple models are used specify which variables enter into each model.
6. Computational resources: What programming languages/software tools were used to write and execute the forecasts?
7. Publications: Does the model derive directly from previously published work? If so please include references.
Forecast evaluation

Forecasts will be quantitatively evaluated for each target using two metrics. Point forecasts will be evaluated using relative mean absolute error to assess performance relative to a seasonal autoregressive model and to other forecasts. The probability distributions will be evaluated using the logarithmic score. For each target, relative MAE and the logarithmic score will be calculated across all seasons and forecast times (week of the season) as well as for specific seasons and forecast time to identify potential differences in model strengths. The primary comparisons will be made for the testing period (2010-2013), however forecasts will also be compared between the training and testing periods to assess how forecast accuracy changes when predicting on data that was excluded from the model development process.

IMPORTANT NOTES: A different model may be employed for each target and location. No metrics will be compared across targets.

Relative Mean Absolute Error

Mean absolute error (MAE) is the mean absolute difference between predictions ;;\hat{\mathbf{y}};; and observations ;;\mathbf{y};; over ;;n;; data points:

$$\text{MAE}(\hat{\mathbf{y}}, \mathbf{y})=\frac{1}{n}\sum\limits_{i=1}^{n} \left|\hat{y}_{i}-y_{i}\right|$$

Relative MAE for models A and B is:

$$\text{relMAE}_{A,B}=\frac{\text{MAE}_{A}}{\text{MAE}_{B}}$$

An important feature of this metric is that it can be interpreted directly in terms of accuracy of predictions. For example, ;;\text{relMAE}_{A,B};; = 0.8 indicates that, on average, predictions from model A were 20% closer to the observed values than those from model B. Additionally, comparing multiple candidate models to a common baseline model with relative MAE allows for the assessment of the relative accuracy of the candidate models. For example, the relative MAE for model A versus the baseline model can be divided by the relative MAE for model B versus the baseline, resulting in the relative MAE for model A versus model B.

References

Logarithmic Score

The logarithmic scoring rule is a proper scoring rule based on a binned probability distribution of the prediction, ;;\mathbf{y};;. The score is the log of the probability assigned to the observed outcome, ;;i;;:

$$S(\mathbf{p},i) = \text{ln}(p_{i})$$

For example, a single prediction for the peak week includes probabilities for each week (1-52) in the season. The probability assigned to a week when the peak is unlikely would be low, near to zero, while the probability assigned to the forecast peak week is the highest. If the observed peak week is week 30 and ;;p_{30};; = 0.15, the score for this prediction is ;;\text{ln};;(0.15), or approximately -1.9.

Note that the total of these probabilities across all weeks (or incidence bins for the other targets) must equal 1, i.e.:

$$\sum\limits_{i=1}^{52} p_{i} = 1$$

Note that the logarithmic score is based on a probabilistic estimate - a complete probability distribution over all possible outcomes. This is desirable because it requires the forecaster to consider the entire range of possible outcomes, and to estimate the likelihood if each one of them. Two forecasters may agree on which outcome is the most likely, and therefore submit the same point estimate, which would be scored identically by MAE. However, their predictions may differ substantially on how likely this outcome is, and how likely other outcomes are. By considering this probability, the logarithmic score enables scoring of the confidence in the prediction, not just the value of the point prediction.

Another advantage of logarithmic scores is that they can be summed across different time periods, targets, and locations to provide both specific and generalized measures of model accuracy. The bins that will be used for forecasts of the peak week, maximum weekly incidence, and total cases in the season will be specified in the forecast templates.

References