How We Add Value to Public Data With Imputation and Forecasting?
Public data sources are often plagued by missng values. Naively you may think that you can ignore them, but think twice: in most cases, missing data in a table is not missing information, but rather malformatted information. This approach of ignoring or dropping missing values will not be feasible or robust when you want to make a beautiful visualization, or use data in a business forecasting model, a machine learning (AI) applicaton, or a more complex scientific model. All of the above require complete datasets, and naively discarding missing data points amounts to an excessive waste of information. In this example we are continuing the example a not-so-easy to find public dataset.
Completing missing datapoints requires statistical production information (why might the data be missing?) and data science knowhow (how to impute the missing value.) If you do not have a good statistician or data scientist in your team, you will need high-quality, complete datasets. This is what our automated data observatories provide.
Why is data missing?
International organizations offer many statistical products, but usually they are on an ‘as-is’ basis. For example, Eurostat is the world’s premiere statistical agency, but it has no right to overrule whatever data the member states of the European Union, and some other cooperating European countries give to them. And they cannot force these countries to hand over data if they fail to do so. As a result, there will be many data points that are missing, and often data points that have wrong (obsolete) descriptions or geographical dimensions. We will show the geographical aspect of the problem in a separate blogpost; for now, we only focus on missing data.
Some countries have only recently started providing data to the Eurostat umbrella organization, and it is likely that you will find few datapoints for North Macedonia or Bosnia-Herzegovina. Other countries provide data with some delay, and the last one or two years are missing. And there are gaps in some countries’ data, too.
This is a headache if you want to use the data in some machine learning application or in a multiple or panel regression model. You can, of course, discard countries or years where you do not have full data coverage, but this approach usually wastes too much information–if you work with 12 years, and only one data point is available, you would be discarding an entire country’s 11-years’ worth of data. Another option is to estimate the values, or otherwise impute the missing data, when this is possible with reasonable precision. This is where things get tricky, and you will likely need a statistician or a data scientist onboard.
What can we improve?
Consider that the data is only missing from one year for a particular country, 2015. The naive solution would be to omit 2015 or the country at hand from the dataset. This is pretty destructive, because we know a lot about the R&D allocations in this country and in this year! But leaving 2015 blank will not look good on a chart, and will make your machine learning application or your regression model stop.
A statistician or an innovation expert will tell you that you know more-or-less the missing information: the total allocation was most likely not zero in that year. With some statistical or innovation, or public finance specific knowledge you will use the 2014, or 2016 value, or a combination of the two and keep the country and year in the dataset.
Our improved dataset added backcasted (using the best time series model fitting the country’s actually present data), forecasted (again, using the best time series model), and approximated data (using linear approximation.) In a few cases, we add the last or next known value. To give a few quantiative indicators about our work:
- Increased number of observations: 29.2%
- Reduced missing values: -26.4%
- Increased non-missing subset for regression or AI: +64.7%
If your organization is working with panel (longitudional multiple) regressions or various machine learning applications, then your team knows that not havint the +66.67% gain would be a deal-breaker in the choice of models and punctuality of estimates or KPIs or other quantiative products. And that they would spent about 90% of their data resources on achieving this +66.67% gain in usability.
If you happen to work in an NGO, a business unit or a research institute that does not employ data scientists, then it is likely that you can never achieve this improvement, and you have to give up on a number of quantitative tools or visualizations. If you have a data scientist onboard, that professional can use our work as a starting point.
Can you trust our data?
We believe that you can trust our data better than the original public source. We use statistical expertise to find out why data may be missing. Often, it is present in a wrong location (for example, the name of a region changed.)
If you are reluctant to use estimates, think about discarding known actual data from your forecast or visualization, because one data point is missing. How do you provide more accurate information? By hiding known actual data, because one point is missing, or by using all known data and an estimate?
Our codebooks and our API uses the Statistical Data and Metadata eXchange documentation standards to clearly indicate which data is observed, which is missing, which is estimated, and of course, also how it is estimated. This example highlights another important aspect of data trustworthiness. If you have a better idea, you can replace them with a better estimate.
Our indicators come with standardized codebooks that do not only contain the descriptive metadata, but administrative metadata about the history of the indicator values. You will find very important information about the statistical method we used the fill in the data gaps, and even link the reliable, the peer-reviewed scientific, statistical software that made the calculations. For data scientists, we record the plenty of information about the computing environment, too-–this can come handy if your estimates need external authentication, or you suspect a bug.
Avoid the data Sisyphus
If you work in an academic institution, in an NGO or a consultancy, you can never be sure who downloaded the GBARD by socioeconomic objectives (NABS 2007) Eurostat folder from Eurostat. Did they modify the dataset? Did they already make corrections with the missing data? What method did they use? To prevent many potential problems, you will likely download it again, and again, and again…
We have a better solution. You can always rely on our API to import directly the latest, best data, but if you want to be sure, you can use our regular backups on Zenodo. Zenodo is an open science repository managed by CERN and supported by the European Union. On Zenodo, you can find an authoritative copy of our indicator (and its previous versions) with a digital object identifier, in this case, 10.5281/zenodo.5661169. These datasets will be preserved for decades, and nobody can manipulate them. You cannot accidentally overwrite them, and we have no backdoor to modify them.
Are you a data user? Give us some feedback! Shall we do some further automatic data enhancements with our datasets? Document with different metadata? Link more information for business, policy, or academic use? Please give us any feedback!