R-bloggers | Green Deal Data Observatory

stacodelists: use standard, language-independent variable codes to help international data interoperability and machine reuse in R

Wed, 29 Jun 2022 08:12:00 +0100

Visit the documentation website of statcodelists on statcodelists.dataobservatory.eu/.

The goal of statcodelists is to promote the reuse and exchange of statistical information and related metadata with making the internationally standardized SDMX code lists available for the R user. SDMX – the Statistical Data and Metadata eXchange has been published as an ISO International Standard (ISO 17369). The metadata definitions, including the codelists are updated regularly according to the standard. The authoritative version of the code lists made available in this package is https://sdmx.org/?page_id=3215/.

Click to expand table of contents of the post

Table of Contents

Purpose

Cross-domain concepts in the SDMX framework describe concepts relevant to many, if not all, statistical domains. SDMX recommends using these concepts whenever feasible in SDMX structures and messages to promote the reuse and exchange of statistical information and related metadata between organisations.

Code lists are predefined sets of terms from which some statistical coded concepts take their values. SDMX cross-domain code lists are used to support cross-domain concepts. What are these cross-domain coded concepts?

Geographical codes, like NL: the Netherlands in the CL_AREA code list.
Standard industry codes J631 for Data processing, hosting and related activities in Europe. (NACE Rev 2 in Europe, beware, it is J592in Australia and New Zealand, see CL_ACTIVITY_ANZSIC06.)
Occupations, like OC2521 for Database designers and administrators in CL_OCCUPATIONS
Time fomatting standards, like CCYY for annual data series in CL_TIME_FORMAT.

Check out the available codlists on the package homepage.

The use of common code lists will help users to work even more efficiently, easing the maintenance of and reducing the need for mapping systems and interfaces delivering data and metadata to them. A very obvious advantage of using the code systems is that you can retrieve data from national sources indifferent of the natural language used in North Macedonia, Japan, the U.S. or the Netherlands. While the data labels may change to be locally human-readable, computers and geeks can read the codes and understand them immediately. Provided that they use the standard codes.

Our data observatories are rolling out SDMX coding across all datasets to help data ingestion and interoperability, data findability and data reuse. statcodelists can help the use of standard SDMX codes in your R workflow–both for downloading data from statistical agencies and to produce publication-ready datasets that the rest of the world (and even APIs) will understand.

Installation

You can install statcodelists from CRAN:

install.packages("statcodelists")

Further recommended code values for expressing general statistical concepts like not applicable, etc., can be found in section Generic codes of the Guidelines for the creation and management of SDMX Cross-Domain Code Lists.

For further codelists used by reliable statistical agency but not harmonized on SDMX level please consult the SDMX Global Registry Codelists page.

The creator of this package is not affiliated with SDMX, and this package was has not been endorsed by SDMX.

Code of Conduct

Please note that the statcodelists project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Analyze Locally, Act Globally: New regions R Package Release

Wed, 16 Jun 2021 12:00:00 +0000

The new version of our rOpenGov R package regions was released today on CRAN. This package is one of the engines of our experimental open data-as-service Green Deal Data Observatory, Economy Data Observatory, Digital Music Observatory prototypes, which aim to place open data packages into open-source applications.

Click to expand table of contents of the post

Table of Contents

In international comparison the use of nationally aggregated indicators often have many disadvantages: they inhibit very different levels of homogeneity, and data is often very limited in number of observations for a cross-sectional analysis. When comparing European countries, a few missing cases can limit the cross-section of countries to around 20 cases which disallows the use of many analytical methods. Working with sub-national statistics has many advantages: the similarity of the aggregation level and high number of observations can allow more precise control of model parameters and errors, and the number of observations grows from 20 to 200-300.

The change from national to sub-national level comes with a huge data processing price: internal administrative boundaries, their names, codes codes change very frequently.

Yet the change from national to sub-national level comes with a huge data processing price. While national boundaries are relatively stable, with only a handful of changes in each recent decade. The change of national boundaries requires a more-or-less global consensus. But states are free to change their internal administrative boundaries, and they do it with large frequency. This means that the names, identification codes and boundary definitions of sub-national regions change very frequently. Joining data from different sources and different years can be very difficult.

Our regions R package helps the data processing, validation and imputation of sub-national, regional datasets and their coding.

There are numerous advantages of switching from a national level of the analysis to a sub-national level comes with a huge price in data processing, validation and imputation, and the regions package aims to help this process.

You can review the problem, and the code that created the two map comparisons, in the Maping Regional Data, Maping Metadata Problems vignette article of the package. A more detailed problem description can be found in Working With Regional, Sub-National Statistical Products.

This package is an offspring of the eurostat package on rOpenGov. It started as a tool to validate and re-code regional Eurostat statistics, but it aims to be a general solution for all sub-national statistics. It will be developed parallel with other rOpenGov packages.

Get the Package

You can install the development version from GitHub with:

devtools::install_github("rOpenGov/regions")

or the released version from CRAN:

install.packages("regions")

You can review the complete package documentation on regions.dataobservaotry.eu. If you find any problems with the code, please raise an issue on Github. Pull requests are welcome if you agree with the Contributor Code of Conduct

If you use regions in your work, please cite the package as: Daniel Antal. (2021, June 16). regions (Version 0.1.7). CRAN. http://doi.org/10.5281/zenodo.4965909

Download the BibLaTeX entry.

Join us

Join our Green Deal Data Observatory collaboration!

Join our open collaboration Green Deal Data Observatory team as a data curator, developer or business developer. More interested in economic policies, particularly computation antitrust, innovation and small enterprises? Check out our Economy Music Observatory team! Or your interest lies more in data governance, trustworthy AI and other digital market problems? Check out our Digital Music Observatory team!

Economic and Environment Impact Analysis, Automated for Data-as-Service

Thu, 03 Jun 2021 16:00:00 +0000

We have released a new version of iotables as part of the rOpenGov project. The package, as the name suggests, works with European symmetric input-output tables (SIOTs). SIOTs are among the most complex governmental statistical products. They show how each country’s 64 agricultural, industrial, service, and sometimes household sectors relate to each other. They are estimated from various components of the GDP, tax collection, at least every five years.

This code tutorial is not outdated, but the iotables R package has a new release with more environmental impact analysis featues.

Click to expand table of contents of the post

Table of Contents

SIOTs offer great value to policy-makers and analysts to make more than educated guesses on how a million euros, pounds or Czech korunas spent on a certain sector will impact other sectors of the economy, employment or GDP. What happens when a bank starts to give new loans and advertise them? How is an increase in economic activity going to affect the amount of wages paid and and where will consumers most likely spend their wages? As the national economies begin to reopen after COVID-19 pandemic lockdowns, is to utilize SIOTs to calculate direct and indirect employment effects or value added effects of government grant programs to sectors such as cultural and creative industries or actors such as venues for performing arts, movie theaters, bars and restaurants.

Making such calculations requires a bit of matrix algebra, and understanding of input-output economics, direct, indirect effects, and multipliers. Economists, grant designers, policy makers have those skills, but until now, such calculations were either made in cumbersome Excel sheets, or proprietary software, as the key to these calculations is to keep vectors and matrices, which have at least one dimension of 64, perfectly aligned. We made this process reproducible with iotables and eurostat on rOpenGov

Our iotables package creates direct, indirect effects and multipliers programatically. Our observatory will make those indicators available for all European countries.

Accessing and tidying the data programmatically

The iotables package is in a way an extension to the eurostat R package, which provides a programmatic access to the Eurostat data warehouse. The reason for releasing a new package is that working with SIOTs requires plenty of meticulous data wrangling based on various metadata sources, apart from actually accessing the data itself. When working with matrix equations, the bar is higher than with tidy data. Not only your rows and columns must match, but their ordering must strictly conform the quadrants of the a matrix system, including the connecting trade or tax matrices.

When you download a country’s SIOT table, you receive a long form data frame, a very-very long one, which contains the matrix values and their labels like this:

## Table naio_10_cp1700 cached at C:\Users\...\Temp\RtmpGQF4gr/eurostat/naio_10_cp1700_date_code_FF.rds

# we save it for further reference here 
saveRDS(naio_10_cp1700, "not_included/naio_10_cp1700_date_code_FF.rds")

# should you need to retrieve the large tempfiles, they are in 
dir (file.path(tempdir(), "eurostat"))

dplyr::slice_head(naio_10_cp1700, n:  5)

## # A tibble: 5 x 7
##   unit    stk_flow induse  prod_na geo       time        values
##   <chr>   <chr>    <chr>   <chr>   <chr>     <date>       <dbl>
## 1 MIO_EUR DOM      CPA_A01 B1G     EA19      2019-01-01 141873.
## 2 MIO_EUR DOM      CPA_A01 B1G     EU27_2020 2019-01-01 174976.
## 3 MIO_EUR DOM      CPA_A01 B1G     EU28      2019-01-01 187814.
## 4 MIO_EUR DOM      CPA_A01 B2A3G   EA19      2019-01-01      0 
## 5 MIO_EUR DOM      CPA_A01 B2A3G   EU27_2020 2019-01-01      0

The metadata reads like this: the units are in millions of euros, we are analyzing domestic flows, and the national account items B1-B2 for the industry A01. The information of a 64x64 matrix (the SIOT) and its connecting matrices, such as taxes, or employment, or C**O₂ emissions, must be placed exactly in one correct ordering of columns and rows. Every single data wrangling error will usually lead in an error (the matrix equation has no solution), or, what is worse, in a very difficult to trace algebraic error. Our package not only labels this data meaningfully, but creates very tidy data frames that contain each necessary matrix of vector with a key column.

iotables package contains the vocabularies (abbreviations and human readable labels) of three statistical vocabularies: the so called COICOP product codes, the NACE industry codes, and the vocabulary of the ESA2010 definition of national accounts (which is the government equivalent of corporate accounting).

Our package currently solves all equations for direct, indirect effects, multipliers and inter-industry linkages. Backward linkages show what happens with the suppliers of an industry, such as catering or advertising in the case of music festivals, if the festivals reopen. The forward linkages show how much extra demand this creates for connecting services that treat festivals as a ‘supplier’, such as cultural tourism.

Example

## Downloading employment data from the Eurostat database.

## Table lfsq_egan22d cached at C:\Users\...\Temp\RtmpGQF4gr/eurostat/lfsq_egan22d_date_code_FF.rds

and match it with the latest structural information on from the Symmetric input-output table at basic prices (product by product) Eurostat product. A quick look at the Eurostat website already shows that there is a lot of work ahead to make the data look like an actual Symmetric input-output table. Download it with iotable_get() which does basic labelling and preprocessing on the raw Eurostat files. Because of the size of the unfiltered dataset on Eurostat, the following code may take several minutes to run.

sk_io <-  iotable_get ( labelled_io_data:  NULL, 
                        source:  "naio_10_cp1700", geo:  "SK", 
                        year:  2015, unit:  "MIO_EUR", 
                        stk_flow:  "TOTAL",
                        labelling:  "iotables" )

## Reading cache file C:\Users\..\Temp\RtmpGQF4gr/eurostat/naio_10_cp1700_date_code_FF.rds

## Table  naio_10_cp1700  read from cache file:  C:\Users\..\Temp\RtmpGQF4gr/eurostat/naio_10_cp1700_date_code_FF.rds

## Saving 808 input-output tables into the temporary directory
## C:\Users\...\Temp\RtmpGQF4gr

## Saved the raw data of this table type in temporary directory C:\Users\...\Temp\RtmpGQF4gr/naio_10_cp1700.rds.

The input_coefficient_matrix_create() creates the input coefficient matrix, which is used for most of the analytical functions.

a_i**j: X_i**j / x_j

It checks the correct ordering of columns, and furthermore it fills up 0 values with 0.000001 to avoid division with zero.

input_coeff_matrix_sk <- input_coefficient_matrix_create(
  data_table:  sk_io
)

## Columns and rows of real_estate_imputed_a, extraterriorial_organizations are all zeros and will be removed.

Then you can create the Leontieff-inverse, which contains all the structural information about the relationships of 64x64 sectors of the chosen country, in this case, Slovakia, ready for the main equations of input-output economics.

I_sk <- leontieff_inverse_create(input_coeff_matrix_sk)

And take out the primary inputs:

primary_inputs_sk <- coefficient_matrix_create(
  data_table:  sk_io, 
  total:  'output', 
  return:  'primary_inputs')

## Columns and rows of real_estate_imputed_a, extraterriorial_organizations are all zeros and will be removed.

Now let’s see if there the government tries to stimulate the economy in three sectors, agricultulre, car manufacturing, and R&D with a billion euros. Direct effects measure the initial, direct impact of the change in demand and supply for a product. When production goes up, it will create demand in all supply industries (backward linkages) and create opportunities in the industries that use the product themselves (forward linkages.)

direct_effects_create( primary_inputs_sk, I_sk ) %>%
  select ( all_of(c("iotables_row", "agriculture",
                    "motor_vechicles", "research_development"))) %>%
  filter (.data$iotables_row %in% c("gva_effect", "wages_salaries_effect", 
                                    "imports_effect", "output_effect"))

##            iotables_row agriculture motor_vechicles research_development
## 1        imports_effect   1.3684350       2.3028203            0.9764921
## 2 wages_salaries_effect   0.2713804       0.3183523            0.3828014
## 3            gva_effect   0.9669621       0.9790771            0.9669467
## 4         output_effect   2.2876287       3.9840251            2.2579634

Car manufacturing requires much imported components, so each extra demand will create a large importing activity. The R&D will create a the most local wages (and supports most jobs) because research is job-intensive. As we can see, the effect on imports, wages, gross value added (which will end up in the GDP) and output changes are very different in these three sectors.

This is not the total effect, because some of the increased production will translate into income, which in turn will be used to create further demand in all parts of the domestic economy. The total effect is characterized by multipliers.

Then solve for the multipliers:

multipliers_sk <- input_multipliers_create( 
  primary_inputs_sk %>%
    filter (.data$iotables_row: = "gva"), I_sk )

And select a few industries:

set.seed(12)
multipliers_sk %>% 
  tidyr::pivot_longer ( -all_of("iotables_row"), 
                        names_to:  "industry", 
                        values_to:  "GVA_multiplier") %>%
  select (-all_of("iotables_row")) %>%
  arrange( -.data$GVA_multiplier) %>%
  dplyr::sample_n(8)

## # A tibble: 8 x 2
##   industry               GVA_multiplier
##   <chr>                           <dbl>
## 1 motor_vechicles                  7.81
## 2 wood_products                    2.27
## 3 mineral_products                 2.83
## 4 human_health                     1.53
## 5 post_courier                     2.23
## 6 sewage                           1.82
## 7 basic_metals                     4.16
## 8 real_estate_services_b           1.48

Vignettes

The Germany 1990 provides an introduction of input-output economics and re-creates the examples of the Eurostat Manual of Supply, Use and Input-Output Tables, by Jörg Beutel (Eurostat Manual).

The United Kingdom Input-Output Analytical Tables Daniel Antal, based on the work edited by Richard Wild is a use case on how to correctly import data from outside Eurostat (i.e., not with eurostat::get_eurostat()) and join it properly to a SIOT. We also used this example to create unit tests of our functions from a published, official government statistical release.

Finally, Working With Eurostat Data is a detailed use case of working with all the current functionalities of the package by comparing two economies, Czechia and Slovakia and guides you through a lot more examples than this short blogpost.

Our package was originally developed to calculate GVA and employment effects for the Slovak music industry, and similar calculations for the Hungarian film tax shelter. We can now programatically create reproducible multipliers for all European economies in the Digital Music Observatory, and create further indicators for economic policy making in the Economy Data Observatory.

Environmental Impact Analysis

Our package allows the calculation of various economic policy scenarios, such as changing the VAT on meat or effects of re-opening music festivals on aggregate demand, GDP, tax revenues, or employment. But what about the C**O₂, methane and other greenhouse gas effects of the reopening festivals, or the increasing meat prices?

Technically our package can already calculate such effects, but to do so, you have to carefully match further statistical vocabulary items used by the European Environmental Agency about air pollutants and greenhouse gases.

The last released version of iotables is Importing and Manipulating Symmetric Input-Output Tables (Version 0.4.4). Zenodo. https://doi.org/10.5281/zenodo.4897472, but we are already working on a new major release. (Download the BibLaTeX entry.) In that release, we are planning to build in the necessary vocabulary into the metadata functions to increase the functionality of the package, and create new indicators for our Green Deal Data Observatory. This experimental data observatory is creating new, high quality statistical indicators from open governmental and open science data sources that has not seen the daylight yet.

rOpenGov and the EU Datathon Challenges

rOpenGov, Reprex, and other open collaboration partners teamed up to build on our expertise of open source statistical software development further: we want to create a technologically and financially feasible data-as-service to put our reproducible research products into wider user for the business analyst, scientific researcher and evidence-based policy design communities.

rOpenGov is a community of open governmental data and statistics developers with many packages that make programmatic access and work with open data possible in the R language. Reprex is a Dutch-startup that teamed up with rOpenGov and other open collaboration partners to create a technologically and financially feasible service to exploit reproducible research products for the wider business, scientific and evidence-based policy design community. Open data is a legal concept - it means that you have the rigth to reuse the data, but often the reuse requires significant programming and statistical know-how. We entered into the annual EU Datathon competition in all three challenges with our applications to not only provide open-source software, but daily updated, validated, documented, high-quality statistical indicators as open data in an open database. Our iotables package is one of our many open-source building blocks to make open data more accessible to all.

Join our Green Deal Data Observatory collaboration!

Regional Geocoding Harmonization Case Study - Regional Climate Change Awareness Datasets

Sat, 06 Mar 2021 00:00:00 +0000

library(regions)
library(lubridate)
library(dplyr)

if ( dir.exists('data-raw') ) {
  data_raw_dir <- "data-raw"
} else {
  data_raw_dir <- file.path("..", "..", "data-raw")
  }

Going beyond the national level

Let’s start with a dirty averaging by sub-national unit. The w1 weighting variable contains the post-stratification weight for the national samples. The Eurobarometer samples represent nations (with the exception of East and West Germany, Northern Ireland and Great Britain.) The average of the w1 variable is 1.00 for each sample, but it is not necessarily 1 for smaller territorial units. If sum(w)>1 for say, AT23 it only means that the AT23 region was undersampled relatively to the rest of Austria, and responses must be over-weighted in post-stratification.

There is no way to make the samples become regionally representative, and a correct post-stratification would require further data about the sampel design. But we can simply adjust to over/undersampling by making sure that oversampled territorial averages are proportionally increased and undersampled ones are decreased. [Another ‘dirty’ averaging would be the use of an unweighted average, but our method is better, because it more-or-less adjusts gender and education level biases, but leaves intra-country regional biases in the sample.]

panel <- readRDS((file.path(data_raw_dir, "climate-panel.rds")))

climate_data <-  panel %>%
  mutate ( year:  lubridate::year(date_of_interview)) %>%
  select ( all_of(c("isocntry", "geo", "w1")), 
           contains("problem")
  )  %>%
  mutate ( 
    # use the post-stratification weights for national samples
    serious_world_problems_first:  w1*serious_world_problems_first , 
    serious_world_problems_climate_change:  w1*serious_world_problems_climate_change) %>%
  group_by (  .data$geo ) %>%
  summarise( serious_world_problems_first:  mean(serious_world_problems_first, na.rm=TRUE),
             serious_world_problems_climate_change:  mean (serious_world_problems_climate_change, na.rm=TRUE),
             mean_w1:  mean(w1)
             ) %>%
  mutate ( 
    # adjust for post-stratification weight bias due to regional over/undersampling
    climate_first:  serious_world_problems_first / mean_w1, 
    climate_mentioned:  serious_world_problems_climate_change / mean_w1
    )

So, we averaged, weighted and adjusted the mentioning of climate change as the world’s most serious, or one of the most serious problems by NUTS regions.

Aggregation level

The problem is that most statistical data is available in for the NUTS regional boundaries according to the NUTS2016 definition. However, GESIS uses NUTS2013 regions, so 252 regional codes in the four survey waves are invalid. Some data is available only on national level, but it can be projected to regional level, because small countries like Luxembourg have no regional divisions. Larger countries like Germany are divided only on state level (NUTS1), while small countries are divided on NUTS3 level.

This leads to various problems. Many data is available only on NUTS2 level, in which case NUTS1 data should be projected to its constituent smaller NUTS2 regions, and NUTS3 level data must be aggregated up to larger, containing NUTS2 levels.

Of course, we also must choose if we use `NUTS2013 or NUTS2016 boundaries. Sub-national boundaries have changed many thousand times in the EU27 countries alone since 1999.

## # A tibble: 5 x 2
##   validate         n
##   <chr>        <int>
## 1 country         15
## 2 invalid        252
## 3 nuts_level_1   132
## 4 nuts_level_2   452
## 5 nuts_level_3   141

Recoding the Regions

Our regions package was designed to keep track of sub-national regional boundary changes. It can validate regional data codes, and to some extent carry out recoding, imputation or simple aggregation.

Recoding means that the boundaries are unchanged, but the country changed the names/codes of regions, because there were other boundary changes which did not affect our observation unit.
Imputation must not be done with usual, general imputation tools, because our data is regionally structured. However, some imputations are very simple, because we can use equality equasions like MT: MT0, MT00.
Often the boundary change is additive, and merged territorial units can simple aggregated for comparison in earlier data.

regional_coding_2016 <- panel %>%
  mutate ( year:  lubridate::year(date_of_interview)) %>%
  select (  all_of(c("isocntry", "geo", "region", "year") ) ) %>%
  distinct_all() %>%
  recode_nuts()

regional_coding_2013 <- panel %>%
  mutate ( year:  lubridate::year(date_of_interview)) %>%
  select (  all_of(c("isocntry", "geo", "region", "year") ) ) %>%
  distinct_all() %>%
  recode_nuts( nuts_year:  2013)

climate_data_recoded <- climate_data %>% 
  left_join ( regional_coding_2016, by:  'geo' ) %>%
  left_join ( regional_coding_2013 %>% 
                select ( all_of(c("geo", "code_2013"))), 
              by:  "geo") %>%
  distinct_all()

saveRDS ( climate_data_recoded , file.path(tempdir(), "climate_panel_recoded_agr.rds"), version:  2)

# not evaluated
saveRDS( climate_data_recoded , file:  file.path("data-raw", "climate_panel_recoded_agr.rds"))

Where Are People More Likely To Treat Climate Change as the Most Serious Global Problem?

Sat, 06 Mar 2021 00:00:00 +0000

library(regions)
library(lubridate)
library(dplyr)

if ( dir.exists('data-raw') ) {
  data_raw_dir <- "data-raw"
} else {
  data_raw_dir <- file.path("..", "..", "data-raw")
  }

The first results of our longitudinal table were difficult to map, because the surveys used an obsolete regional coding. We will adjust the wrong coding, when possible, and join the data with the European Environment Agency’s (EEA) Air Quality e-Reporting (AQ e-Reporting) data on environmental pollution. We recoded the annual level for every available reporting stations [not shown here] and all values are in μg/m3. The period under observation is 2014-2016. Data file: https://www.eea.europa.eu/data-and-maps/data/aqereporting-8 (European Environment Agency 2021).

Recoding the Regions

Recoding means that the boundaries are unchanged, but the country changed the names and codes of regions because there were other boundary changes which did not affect our observation unit. We explain the problem and the solution in greater detail in our tutorial that aggregates the data on regional levels.

panel <- readRDS((file.path(data_raw_dir, "climate-panel.rds")))

climate_data_geocode <-  panel %>%
  mutate ( year:  lubridate::year(date_of_interview)) %>%
  recode_nuts()

Let’s join the air pollution data and join it by corrected geocodes:

load(file.path("data", "air_pollutants.rda")) ## good practice to use system-independent file.path

climate_awareness_air <- climate_data_geocode %>%
  rename ( region_nuts_codes :  .data$code_2016) %>%
  left_join ( air_pollutants, by:  "region_nuts_codes" ) %>%
  select ( -all_of(c("w1", "wex", "date_of_interview", 
                     "typology", "typology_change", "geo", "region"))) %>%
  mutate (
    # remove special labels and create NA_numeric_ 
    age_education:  retroharmonize::as_numeric(age_education)) %>%
  mutate_if ( is.character, as.factor) %>%
  mutate ( 
    # we only have responses from 4 years, and this should be treated as a categorical variable
    year:  as.factor(year) 
    ) %>%
  filter ( complete.cases(.) )

The climate_awareness_air data frame contains the answers of 75086 individual respondents. 17.07% thought that climate change was the most serious world problem and 33.6% mentioned climate change as one of the three most important global problems.

summary ( climate_awareness_air  )

##                  rowid       serious_world_problems_first
##  ZA5877_v2-0-0_1    :    1   Min.   :0.0000              
##  ZA5877_v2-0-0_10   :    1   1st Qu.:0.0000              
##  ZA5877_v2-0-0_100  :    1   Median :0.0000              
##  ZA5877_v2-0-0_1000 :    1   Mean   :0.1707              
##  ZA5877_v2-0-0_10000:    1   3rd Qu.:0.0000              
##  ZA5877_v2-0-0_10001:    1   Max.   :1.0000              
##  (Other)            :75080                               
##  serious_world_problems_climate_change    isocntry    
##  Min.   :0.000                         BE     : 3028  
##  1st Qu.:0.000                         CZ     : 3023  
##  Median :0.000                         NL     : 3019  
##  Mean   :0.336                         SK     : 3000  
##  3rd Qu.:1.000                         SE     : 2980  
##  Max.   :1.000                         DE-W   : 2978  
##                                        (Other):57058  
##                                    marital_status         age_education  
##  (Re-)Married: without children           :13242   18            :15485  
##  (Re-)Married: children this marriage     :12696   19            : 7728  
##  Single: without children                 : 7650   16            : 5840  
##  (Re-)Married: w children of this marriage: 6520   still studying: 5098  
##  (Re-)Married: living without children    : 6225   17            : 5092  
##  Single: living without children          : 4102   15            : 4528  
##  (Other)                                  :24651   (Other)       :31315  
##    age_exact                      occupation_of_respondent
##  Min.   :15.0   Retired, unable to work       :22911      
##  1st Qu.:36.0   Skilled manual worker         : 6774      
##  Median :51.0   Employed position, at desk    : 6716      
##  Mean   :50.1   Employed position, service job: 5624      
##  3rd Qu.:65.0   Middle management, etc.       : 5252      
##  Max.   :99.0   Student                       : 5098      
##                 (Other)                       :22711      
##             occupation_of_respondent_recoded
##  Employed (10-18 in d15a)   :32763          
##  Not working (1-4 in d15a)  :37125          
##  Self-employed (5-9 in d15a): 5198          
##                                             
##                                             
##                                             
##                                             
##                        respondent_occupation_scale_c_14
##  Retired (4 in d15a)                   :22911          
##  Manual workers (15 to 18 in d15a)     :15269          
##  Other white collars (13 or 14 in d15a): 9203          
##  Managers (10 to 12 in d15a)           : 8291          
##  Self-employed (5 to 9 in d15a)        : 5198          
##  Students (2 in d15a)                  : 5098          
##  (Other)                               : 9116          
##                   type_of_community   is_student      no_education     
##  DK                        :   34   Min.   :0.0000   Min.   :0.000000  
##  Large town                :20939   1st Qu.:0.0000   1st Qu.:0.000000  
##  Rural area or village     :24686   Median :0.0000   Median :0.000000  
##  Small or middle sized town: 9850   Mean   :0.0679   Mean   :0.008151  
##  Small/middle town         :19577   3rd Qu.:0.0000   3rd Qu.:0.000000  
##                                     Max.   :1.0000   Max.   :1.000000  
##                                                                        
##    education       year       region_nuts_codes  country_code  
##  Min.   :14.00   2013:25103   LU     : 1432     DE     : 4531  
##  1st Qu.:17.00   2015:    0   MT     : 1398     GB     : 3538  
##  Median :18.00   2017:25053   CY     : 1192     BE     : 3028  
##  Mean   :19.61   2019:24930   SK02   : 1053     CZ     : 3023  
##  3rd Qu.:22.00                EL30   :  974     NL     : 3019  
##  Max.   :30.00                EE     :  973     SK     : 3000  
##                               (Other):68064     (Other):54947  
##      pm2_5             pm10               o3              BaP        
##  Min.   : 2.109   Min.   :  5.883   Min.   : 66.37   Min.   :0.0102  
##  1st Qu.: 9.374   1st Qu.: 28.326   1st Qu.: 90.89   1st Qu.:0.1779  
##  Median :11.866   Median : 33.673   Median :102.81   Median :0.4105  
##  Mean   :12.954   Mean   : 38.637   Mean   :101.49   Mean   :0.8759  
##  3rd Qu.:15.890   3rd Qu.: 49.488   3rd Qu.:110.73   3rd Qu.:1.0692  
##  Max.   :41.293   Max.   :123.239   Max.   :141.04   Max.   :7.8050  
##                                                                      
##       so2              ap_pc1            ap_pc2             ap_pc3       
##  Min.   : 0.0000   Min.   :-4.6669   Min.   :-2.21851   Min.   :-2.1007  
##  1st Qu.: 0.0000   1st Qu.:-0.4624   1st Qu.:-0.49130   1st Qu.:-0.5695  
##  Median : 0.0000   Median : 0.4263   Median : 0.02902   Median :-0.1113  
##  Mean   : 0.1032   Mean   : 0.1031   Mean   : 0.04166   Mean   :-0.1746  
##  3rd Qu.: 0.0000   3rd Qu.: 0.9748   3rd Qu.: 0.57416   3rd Qu.: 0.3309  
##  Max.   :42.5325   Max.   : 2.0344   Max.   : 3.25841   Max.   : 4.1615  
##                                                                          
##      ap_pc4            ap_pc5        
##  Min.   :-1.7387   Min.   :-2.75079  
##  1st Qu.:-0.1669   1st Qu.:-0.18748  
##  Median : 0.0371   Median : 0.01811  
##  Mean   : 0.1154   Mean   : 0.06797  
##  3rd Qu.: 0.3050   3rd Qu.: 0.34937  
##  Max.   : 3.2476   Max.   : 1.42816  
##

Let’s see a simple CART tree! We remove the regional codes, because there are very serious differences among regional climate awareness. These differences, together with education level, and the year we are talking about, are the most important predictors of thinking about climate change as the most important global problem in Europe.

# Classification Tree with rpart
library(rpart)

# grow tree
fit <- rpart(as.factor(serious_world_problems_first) ~ .,
   method="class", data=climate_awareness_air %>%
     select ( - all_of(c("rowid", "region_nuts_codes"))), 
   control:  rpart.control(cp:  0.005))

printcp(fit) # display the results

## 
## Classification tree:
## rpart(formula:  as.factor(serious_world_problems_first) ~ ., 
##     data:  climate_awareness_air %>% select(-all_of(c("rowid", 
##         "region_nuts_codes"))), method:  "class", control:  rpart.control(cp:  0.005))
## 
## Variables actually used in tree construction:
## [1] age_education                         isocntry                             
## [3] serious_world_problems_climate_change year                                 
## 
## Root node error: 12817/75086:  0.1707
## 
## n= 75086 
## 
##          CP nsplit rel error  xerror      xstd
## 1 0.0240566      0   1.00000 1.00000 0.0080438
## 2 0.0082703      3   0.92783 0.92783 0.0078055
## 3 0.0050000      5   0.91129 0.91425 0.0077588

plotcp(fit) # visualize cross-validation results

summary(fit) # detailed summary of splits

## Call:
## rpart(formula:  as.factor(serious_world_problems_first) ~ ., 
##     data:  climate_awareness_air %>% select(-all_of(c("rowid", 
##         "region_nuts_codes"))), method:  "class", control:  rpart.control(cp:  0.005))
##   n= 75086 
## 
##            CP nsplit rel error    xerror        xstd
## 1 0.024056592      0 1.0000000 1.0000000 0.008043837
## 2 0.008270266      3 0.9278302 0.9278302 0.007805478
## 3 0.005000000      5 0.9112897 0.9142545 0.007758824
## 
## Variable importance
## serious_world_problems_climate_change                              isocntry 
##                                    31                                    26 
##                          country_code                                   BaP 
##                                    20                                     8 
##                                 pm2_5                                ap_pc1 
##                                     4                                     3 
##                         age_education                                  pm10 
##                                     2                                     2 
##                             education                                ap_pc2 
##                                     2                                     1 
##                                  year 
##                                     1 
## 
## Node number 1: 75086 observations,    complexity param=0.02405659
##   predicted class=0  expected loss=0.1706976  P(node): 1
##     class counts: 62269 12817
##    probabilities: 0.829 0.171 
##   left son=2 (25229 obs) right son=3 (49857 obs)
##   Primary splits:
##       serious_world_problems_climate_change < 0.5          to the right, improve=2214.2040, (0 missing)
##       isocntry                              splits as  RRLLLRRRLLRLRLLLLLLLLLLRRLLLRLL, improve= 728.0160, (0 missing)
##       country_code                          splits as  RRLLLRRLLRLLLLLLLLLLRRLLLRLL, improve= 673.3656, (0 missing)
##       BaP                                   < 0.4300347    to the right, improve= 310.6229, (0 missing)
##       pm2_5                                 < 13.38264     to the right, improve= 296.4013, (0 missing)
##   Surrogate splits:
##       age_education splits as  ----RRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRL-RRR-RRRRRRRRR--RRRLLR--R-R, agree=0.664, adj=0, (0 split)
##       pm10          < 7.491315     to the left,  agree=0.664, adj=0, (0 split)
## 
## Node number 2: 25229 observations
##   predicted class=0  expected loss=0  P(node): 0.3360014
##     class counts: 25229     0
##    probabilities: 1.000 0.000 
## 
## Node number 3: 49857 observations,    complexity param=0.02405659
##   predicted class=0  expected loss=0.2570752  P(node): 0.6639986
##     class counts: 37040 12817
##    probabilities: 0.743 0.257 
##   left son=6 (34631 obs) right son=7 (15226 obs)
##   Primary splits:
##       isocntry     splits as  RRLLLRRRLLRLRLLLLLLLLLLRRLLLRLL, improve=1454.9460, (0 missing)
##       country_code splits as  RRLLLRRLLRLLLLLLLLLLRRLLLRLL, improve=1359.7210, (0 missing)
##       BaP          < 0.4300347    to the right, improve= 629.8844, (0 missing)
##       pm2_5        < 13.38264     to the right, improve= 555.7484, (0 missing)
##       ap_pc1       < -0.005459537 to the left,  improve= 533.3579, (0 missing)
##   Surrogate splits:
##       country_code splits as  RRLLLRRLLRLLLLLLLLLLRRLLLRLL, agree=0.987, adj=0.957, (0 split)
##       BaP          < 0.1749425    to the right, agree=0.775, adj=0.264, (0 split)
##       pm2_5        < 5.206993     to the right, agree=0.737, adj=0.140, (0 split)
##       ap_pc1       < 1.405527     to the left,  agree=0.733, adj=0.126, (0 split)
##       pm10         < 25.31211     to the right, agree=0.718, adj=0.076, (0 split)
## 
## Node number 6: 34631 observations
##   predicted class=0  expected loss=0.1769802  P(node): 0.4612178
##     class counts: 28502  6129
##    probabilities: 0.823 0.177 
## 
## Node number 7: 15226 observations,    complexity param=0.02405659
##   predicted class=0  expected loss=0.4392487  P(node): 0.2027808
##     class counts:  8538  6688
##    probabilities: 0.561 0.439 
##   left son=14 (11607 obs) right son=15 (3619 obs)
##   Primary splits:
##       isocntry      splits as  LL---LLR--L-L----------LL---R--, improve=337.5462, (0 missing)
##       country_code  splits as  LL---LR--L-L--------LL---R--, improve=337.5462, (0 missing)
##       age_education splits as  ----LLLLLL-LLLRRRRRRR-RRRRRRRRRL-RRRRRRLLRR-RRRRLLRLRL-RRLRRR-RRR-LLLLRRR-----LR-----L-R, improve=294.0807, (0 missing)
##       education     < 22.5         to the left,  improve=262.3747, (0 missing)
##       BaP           < 0.053328     to the right, improve=232.7043, (0 missing)
##   Surrogate splits:
##       BaP           < 0.053328     to the right, agree=0.878, adj=0.485, (0 split)
##       pm2_5         < 4.810361     to the right, agree=0.827, adj=0.271, (0 split)
##       ap_pc2        < 0.8746175    to the left,  agree=0.792, adj=0.124, (0 split)
##       so2           < 0.3302972    to the left,  agree=0.781, adj=0.078, (0 split)
##       age_education splits as  ----LLLLLL-LLLLLLLRLR-LRRLRRRRRR-RRRRLLLLLR-LRLRLLRRLL-LLRLLR-LLR-RRLLLLL-----RR-----R-L, agree=0.779, adj=0.071, (0 split)
## 
## Node number 14: 11607 observations,    complexity param=0.008270266
##   predicted class=0  expected loss=0.3804601  P(node): 0.1545827
##     class counts:  7191  4416
##    probabilities: 0.620 0.380 
##   left son=28 (7462 obs) right son=29 (4145 obs)
##   Primary splits:
##       age_education                    splits as  ----LLLLLL-LRRRRRRRRR-RRLRRLRRLL-RRRRLRLLRR-RLRLLLRLRL-RR-RR--RRL-L-LLRRR------------L-R, improve=123.71070, (0 missing)
##       year                             splits as  R-LR, improve=107.79460, (0 missing)
##       education                        < 20.5         to the left,  improve= 90.28724, (0 missing)
##       occupation_of_respondent         splits as  LRRLRRRRRLRLLLRLLL, improve= 84.62865, (0 missing)
##       respondent_occupation_scale_c_14 splits as  LRLLLRRL, improve= 68.88653, (0 missing)
##   Surrogate splits:
##       education                        < 20.5         to the left,  agree=0.950, adj=0.861, (0 split)
##       occupation_of_respondent         splits as  LLLLRLLRRLRLLLRLLL, agree=0.738, adj=0.267, (0 split)
##       respondent_occupation_scale_c_14 splits as  LRLLLLRL, agree=0.733, adj=0.251, (0 split)
##       is_student                       < 0.5          to the left,  agree=0.709, adj=0.186, (0 split)
##       age_exact                        < 23.5         to the right, agree=0.676, adj=0.094, (0 split)
## 
## Node number 15: 3619 observations
##   predicted class=1  expected loss=0.3722023  P(node): 0.04819807
##     class counts:  1347  2272
##    probabilities: 0.372 0.628 
## 
## Node number 28: 7462 observations
##   predicted class=0  expected loss=0.326052  P(node): 0.09937938
##     class counts:  5029  2433
##    probabilities: 0.674 0.326 
## 
## Node number 29: 4145 observations,    complexity param=0.008270266
##   predicted class=0  expected loss=0.4784077  P(node): 0.05520337
##     class counts:  2162  1983
##    probabilities: 0.522 0.478 
##   left son=58 (2573 obs) right son=59 (1572 obs)
##   Primary splits:
##       year                     splits as  L-LR, improve=40.13885, (0 missing)
##       occupation_of_respondent splits as  LRLLRRRRRLRLLLRLLL, improve=18.33254, (0 missing)
##       marital_status           splits as  LRRRLRRRLRRLRLLRRRRRRLRLRLLRR, improve=17.86888, (0 missing)
##       type_of_community        splits as  LRLRL, improve=17.55254, (0 missing)
##       age_education            splits as  ------------LLRRRRRRR-RR-RL-RR---LRRR-R--LR-R-R---R-R--RR-RR--RR------RRR--------------R, improve=14.66121, (0 missing)
##   Surrogate splits:
##       type_of_community splits as  LLLRL, agree=0.777, adj=0.412, (0 split)
##       marital_status    splits as  RRLLLLLRLLLLLLLRRRLLLLLLRLRLL, agree=0.680, adj=0.155, (0 split)
##       isocntry          splits as  LL---LL---L-R----------LL------, agree=0.669, adj=0.127, (0 split)
##       country_code      splits as  LL---L---L-R--------LL------, agree=0.669, adj=0.127, (0 split)
##       o3                < 83.06345     to the right, agree=0.650, adj=0.076, (0 split)
## 
## Node number 58: 2573 observations
##   predicted class=0  expected loss=0.4240187  P(node): 0.03426737
##     class counts:  1482  1091
##    probabilities: 0.576 0.424 
## 
## Node number 59: 1572 observations
##   predicted class=1  expected loss=0.43257  P(node): 0.02093599
##     class counts:   680   892
##    probabilities: 0.433 0.567

# plot tree
plot(fit, uniform=TRUE,
   main="Classification Tree: Climate Change Is The Most Serious Threat")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

## Warning in labels.rpart(x, minlength:  minlength): more than 52 levels in a
## predicting factor, truncated for printout

saveRDS ( climate_awareness_air , file.path(tempdir(), "climate_panel_recoded.rds"), version:  2)

# not evaluated
saveRDS( climate_awareness_air, file:  file.path("data-raw", "climate-panel_recoded.rds"))

Retrospective Survey Harmonization Case Study - Climate Awareness Change in Europe 2013-2019.

Fri, 05 Mar 2021 00:00:00 +0000

Retrospective survey harmonization comes with many challenges, as we have shown in the introduction to this tutorial case study. In this example, we will work with Eurobarometer’s data.

This code tutorial is not outdated, but the retroharmonize R package has a new (development) release with more featues.

Click to expand table of contents of the post

Table of Contents

Please use the development version of retroharmonize:

devtools::install_github("antaldaniel/retroharmonize")

library(retroharmonize)
library(dplyr)       # this is necessary for the example 
library(lubridate)   # easier date conversion

## Warning: package 'lubridate' was built under R version 4.0.4

library(stringr)     # You can also use base R string processing functions

Get the Data

retroharmonize is not associated with Eurobarometer, or its creators, Kantar, or its archivists, GESIS. We assume that you have acquired the necessary files from GESIS after carefully reading their terms and you placed it on a path that you call gesis_dir. The precise documentation of the data we use can be found in this supporting blogpost. To reproduce this blogpost, you will need ZA5877_v2-0-0.sav, ZA6595_v3-0-0.sav, ZA6861_v1-2-0.sav, ZA7488_v1-0-0.sav, ZA7572_v1-0-0.sav in a directory that you will name gesis_dir.

#Not run in the blogpost. In the repo we have a saved version.
climate_change_files <- c("ZA5877_v2-0-0.sav", "ZA6595_v3-0-0.sav",  "ZA6861_v1-2-0.sav", 
                          "ZA7488_v1-0-0.sav", "ZA7572_v1-0-0.sav")

eb_waves <- read_surveys(file.path(gesis_dir, climate_change_files), .f='read_spss')

if (dir.exists("data-raw")) {
  save ( eb_waves,  file:  file.path("data-raw", "eb_climate_change_waves.rda") )
}

if ( file.exists( file.path("data-raw", "eb_climate_change_waves.rda") )) {
  load (file.path( "data-raw", "eb_climate_change_waves.rda" ) )
} else {
  load (file.path("..", "..",  "data-raw", "eb_climate_change_waves.rda") )
}

The eb_waves nested list contains five surveys imported from SPSS to the survey class of retroharmonize. The survey class is a data.frame that retains important metadata for further harmonization.

document_waves (eb_waves)

## # A tibble: 5 x 5
##   id            filename           ncol  nrow object_size
##   <chr>         <chr>             <int> <int>       <dbl>
## 1 ZA5877_v2-0-0 ZA5877_v2-0-0.sav   604 27919   139352456
## 2 ZA6595_v3-0-0 ZA6595_v3-0-0.sav   519 27718   119370440
## 3 ZA6861_v1-2-0 ZA6861_v1-2-0.sav   657 27901   151397528
## 4 ZA7488_v1-0-0 ZA7488_v1-0-0.sav   752 27339   169465928
## 5 ZA7572_v1-0-0 ZA7572_v1-0-0.sav   348 27655    80562432

Beware the object sizes. If you work with many surveys, memory-efficient programming becomes imperative. We will be subsetting whenever possible.

Metadata analysis

As noted before, prepare to work with nested lists. Each imported survey is nested as a data frame in the eb_waves list.

Metadata: Protocol Variables

Eurobarometer calls certain metadata elements, like interviewee cooperation level or the date of a survey interview as protocol variable. Let’s start here. This will be our template to harmonize more and more aspects of the five surveys (which are, in fact, already harmonization of about 30 surveys conducted in a single ‘wave’ in multiple countries.)

# select variables of interest from the metadata
eb_protocol_metadata <- eb_climate_metadata %>%
  filter ( .data$label_orig %in% c("date of interview") |
             .data$var_name_orig: = "rowid")  %>%
  suggest_var_names( survey_program:  "eurobarometer" )

# subset and harmonize these variables in all nested list items of 'waves' of surveys
interview_dates <- harmonize_var_names(eb_waves, 
                                       eb_protocol_metadata )

# apply similar data processing rules to same variables
interview_dates <- lapply (interview_dates, 
                      function (x) x %>% mutate ( date_of_interview:  as_character(.data$date_of_interview) )
                      )

# join the individual survey tables into a single table 
interview_dates <- as_tibble ( Reduce (rbind, interview_dates) )

# Check the variable classes.

vapply(interview_dates, function(x) class(x)[1], character(1))

##             rowid date_of_interview 
##       "character"       "character"

This is our sample workflow for each block of variables.

Get a unique identifier.
Add other variables
Harmonize the variable names
Subset the data leaving out anything that you do not harmonize in this block.
Apply some normalization in a nested list.
When the variables are harmonized to same name, class, merge them into a data.frame-like tibble object.

Now finish the harmonization. Wednesday, 31st October 2018 should become a Date type 2018-10-31.

require(lubridate)
harmonize_date <- function(x) {
  x <- tolower(as.character(x))
  x <- gsub("monday|tuesday|wednesday|thursday|friday|saturday|sunday|\\,|th|nd|rd|st", "", x)
  x <- gsub("decemberber", "december", x) # all those annoying real-life data problems!
  x <- stringr::str_trim (x, "both")
  x <- gsub("^0", "", x )
  x <- gsub("\\s\\s", "\\s", x)
  lubridate::dmy(x) 
}

interview_dates <- interview_dates %>%
  mutate ( date_of_interview:  harmonize_date(.data$date_of_interview) )

vapply(interview_dates, function(x) class(x)[1], character(1))

##             rowid date_of_interview 
##       "character"            "Date"

To avoid duplication of row IDs in surveys that may not be unique in different surveys, we created a simple, sequential ID for each survey, including the ID of the original file.

set.seed(2021)
sample_n(interview_dates, 6)

## # A tibble: 6 x 2
##   rowid               date_of_interview
##   <chr>               <date>           
## 1 ZA7488_v1-0-0_7016  2018-10-28       
## 2 ZA7488_v1-0-0_19187 2018-11-02       
## 3 ZA6861_v1-2-0_1218  2017-03-18       
## 4 ZA6861_v1-2-0_4142  2017-03-21       
## 5 ZA7572_v1-0-0_12363 2019-04-17       
## 6 ZA7572_v1-0-0_8071  2019-04-18

After this type-conversion problem let’s see an issue when an original SPSS variable can have two meaningful R representations.

Metadata: Geographical information

Let’s continue with harmonizing geographical information in the files. In this example, var_name_suggested will contain the harmonized variable name. It is likely that you have to make this call, after carefully reading the original questionnaires and codebooks.

eb_regional_metadata <- eb_climate_metadata %>%
  filter ( grepl( "rowid|isocntry|^nuts$", .data$var_name_orig)) %>%
  suggest_var_names( survey_program:  "eurobarometer" ) %>%
  mutate ( var_name_suggested:  case_when ( 
    var_name_suggested: = "region_nuts_codes"     ~ "geo",
    TRUE ~ var_name_suggested ))

The harmonize_var_names() takes all variables in the subsetted, geographical metadata table, and brings them to the harmonized var_name_suggested name. The function subsets the surveys to avoid the presence of non-harmonized variables. All regional NUTS codes become geo in our case:

geography <- harmonize_var_names(eb_waves, 
                                 eb_regional_metadata)

If you are used to work with single survey files, you are likely to work in a tabular format, which easily converts into a data.frame like object, in our example, to tidyverse’s tibble. However, when working with longitudinal data, it is far simpler to work with nested lists, because the tables usually have different dimensions (neither the rows corresponding to observations or the columns are the same across all survey files.)

In the nested list, each list element is a single, tabular-format survey. (In fact, the survey are in retroharmonize’s survey class, which is a rich tibble that contains the metadata and the processing history of the survey.)

The regional information in the Eurobarometer files is contained in the nuts variable. We want to keep both the original labels and values. The original values are the region’s codes, and the labels are the names. The easiest and fastest solution is the base R lapply loop.

geography <- lapply ( geography, 
                      function (x) x %>% mutate ( region:  as_character(geo), 
                                                  geo   :  as.character(geo) )  
)

Because each table has exactly the same columns, we can simply use rbind() and reduce the list to a modern data.frame, i.e. a tibble.

geography <- as_tibble ( Reduce (rbind, geography) )

Let’s see a dozen cases:

set.seed(2021)
sample_n(geography, 12)

## # A tibble: 12 x 4
##    rowid               isocntry geo   region              
##    <chr>               <chr>    <chr> <chr>               
##  1 ZA7488_v1-0-0_7016  SI       SI012 Podravska           
##  2 ZA7488_v1-0-0_19187 PL       PL63  Pomorskie           
##  3 ZA6861_v1-2-0_1218  DK       DK02  Sjaelland           
##  4 ZA6861_v1-2-0_4142  FI       FI1B  Helsinki-Uusimaa    
##  5 ZA7572_v1-0-0_12363 SE       SE12  Oestra Mellansverige
##  6 ZA7572_v1-0-0_8071  IT       ITH   Nord-Est [IT]       
##  7 ZA6861_v1-2-0_6145  IE       IE021 Dublin              
##  8 ZA6861_v1-2-0_24638 RO       RO31  South [RO]          
##  9 ZA7488_v1-0-0_11315 CY       CY    REPUBLIC OF CYPRUS  
## 10 ZA6595_v3-0-0_27568 HR       HR041 Grad Zagreb         
## 11 ZA7572_v1-0-0_17397 CZ       CZ06  Jihovychod          
## 12 ZA6861_v1-2-0_10993 PT       PT17  Lisboa

The idea is that we do similar variable harmonization block by block, and eventually we will join them together. Next step: socio-demography and weights.

Socio-demography and Weights

There are a few peculiar issues to look out for. This example shows that survey harmonization requires plenty of expert judgment, and you cannot fully automate the process.

The Eurobarometer archives do not use all weight and demographic variable names consistently. For example, the wex variable, which is a projected weight for the country’s 15 years old or older population is sometimes called wex, sometimes wextra. The individual survey’s post-stratification weight is the w1 variable, but this is not necessarily what you need to use.

The suggest_var_names() function has a parameter for survey_program: "eurobaromater" which normalizes a bit the most used variables. For example, all variations of wex, wextra wil be noramlized to wex. You can ignore this parameter and use your own names, too.

eb_demography_metadata  <- eb_climate_metadata %>%
  filter ( grepl( "rowid|isocntry|^d8$|^d7$|^wex|^w1$|d25|^d15a|^d11$", .data$var_name_orig) ) %>%
  suggest_var_names( survey_program:  "eurobarometer")

As you can see, using the original labels would not help, because they also contain various alterations.

eb_demography_metadata %>%
  select ( filename, var_name_orig, label_orig, var_name_suggested ) %>%
  filter (var_name_orig %in% c("wex", "wextra") )

##            filename var_name_orig                                  label_orig
## 1 ZA5877_v2-0-0.sav        wextra      weight extrapolated population 15 plus
## 2 ZA6595_v3-0-0.sav        wextra      weight extrapolated population 15 plus
## 3 ZA6861_v1-2-0.sav           wex weight extrapolated population aged 15 plus
## 4 ZA7488_v1-0-0.sav           wex weight extrapolated population aged 15 plus
## 5 ZA7572_v1-0-0.sav           wex weight extrapolated population aged 15 plus
##   var_name_suggested
## 1                wex
## 2                wex
## 3                wex
## 4                wex
## 5                wex

demography <- harmonize_var_names ( waves:  eb_waves, 
                                    metadata:  eb_demography_metadata )

Socio-demographic variables like level of highest education or occupation are rather country-specific. Eurobarometer uses standardized occupation and marital status scales, and a proxy for education levels, age of leaving full-time education.

This is a particularly tricky variable, because it’s coding in fact contains three different variables - school leaving age, except for students, and except for people who did not finish their compulsory primary school. And while school leaving age was a good proxy since the 1970s, in the age when the EU is promoting life-long-learning becomes less and less useful, as people stop and re-start their education throughout their lives.

example <- demography[[1]] %>%
  mutate ( across ( -any_of(c("rowid", "w1", "wex")), as_character) ) %>%
  mutate ( across (any_of(c("w1", "wex")), as_numeric) )
unique ( example$age_education )

##  [1] "22"                     "25"                     "17"                    
##  [4] "19"                     "12"                     "23"                    
##  [7] "18"                     "20"                     "21"                    
## [10] "14"                     "24"                     "16"                    
## [13] "26"                     "15"                     "Still studying"        
## [16] "DK"                     "31"                     "29"                    
## [19] "27"                     "13"                     "32"                    
## [22] "28"                     "30"                     "53"                    
## [25] "42"                     "62"                     "40"                    
## [28] "No full-time education" "Refusal"                "37"                    
## [31] "39"                     "34"                     "35"                    
## [34] "47"                     "36"                     "45"                    
## [37] "51"                     "33"                     "43"                    
## [40] "38"                     "49"                     "46"                    
## [43] "41"                     "57"                     "7"                     
## [46] "48"                     "44"                     "50"                    
## [49] "56"                     "8"                      "11"                    
## [52] "10"                     "9"                      "75 years"              
## [55] "6"                      "3"                      "54"                    
## [58] "55"                     "60"                     "64"                    
## [61] "2 years"                "58"                     "52"                    
## [64] "72"                     "61"                     "4"                     
## [67] "63"

The seamingly trival age_exact variable has its own issues, too:

unique ( example$age_exact)

##  [1] "54"       "66"       "56"       "53"       "33"       "72"      
##  [7] "83"       "62"       "86"       "77"       "64"       "46"      
## [13] "44"       "59"       "60"       "67"       "63"       "20"      
## [19] "43"       "37"       "78"       "49"       "90"       "45"      
## [25] "28"       "29"       "30"       "39"       "51"       "38"      
## [31] "41"       "71"       "25"       "48"       "79"       "88"      
## [37] "61"       "85"       "70"       "35"       "81"       "52"      
## [43] "57"       "27"       "47"       "15 years" "21"       "42"      
## [49] "32"       "68"       "36"       "34"       "19"       "31"      
## [55] "26"       "23"       "24"       "22"       "16"       "84"      
## [61] "65"       "18"       "55"       "40"       "50"       "73"      
## [67] "69"       "87"       "89"       "74"       "75"       "98 years"
## [73] "76"       "80"       "58"       "82"       "17"       "93"      
## [79] "91"       "92"       "95"       "94"       "97"

Let’s see all the strange labels attached to age-type variables:

collect_val_labels(metadata:  eb_demography_metadata %>%
                     filter ( var_name_suggested %in% c("age_exact", "age_education")) )

##  [1] "2 years"                  "75 years"                
##  [3] "No full-time education"   "Still studying"          
##  [5] "15 years"                 "98 years"                
##  [7] "96 years"                 "[NOT CLEARLY DOCUMENTED]"
##  [9] "74 years"                 "99 and older"            
## [11] "Refusal"                  "87 years"                
## [13] "DK"                       "88 years"

We must handle many exception, so we created a function for this purpose:

remove_years  <- function(x) { 
  x <- gsub("years|and\\solder", "", tolower(x))
  stringr::str_trim (x, "both")}

process_demography <- function (x) { 
  
  x %>% mutate ( across ( -any_of(c("rowid", "w1", "wex")), as_character) ) %>%
    mutate ( across (any_of(c("w1", "wex")), as_numeric) ) %>%
    mutate ( across (contains("age"), remove_years)) %>%
    mutate ( age_exact:  as.numeric (age_exact)) %>%
    mutate ( is_student:  ifelse ( tolower(age_education): = "still studying", 
                                   1, 0), 
             no_education:  ifelse ( tolower(age_education): = "no full-time education", 1, 0)) %>%
    mutate ( education:  case_when (
      grepl("studying", age_education) ~ age_exact, 
      grepl ("education", age_education)  ~ 14, 
      grepl ("refus|document|dk", tolower(age_education)) ~ NA_real_,
      TRUE ~ as.numeric(age_education)
    ))  %>%
    mutate ( education:  case_when ( 
      education < 14 ~ NA_real_, 
      education > 30 ~ 30, 
      TRUE ~ education )) 
}

demography <- lapply ( demography, process_demography )

## Warning in eval_tidy(pair$rhs, env:  default_env): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in eval_tidy(pair$rhs, env:  default_env): NAs introduced by coercion

## Warning in eval_tidy(pair$rhs, env:  default_env): NAs introduced by coercion

## Warning in eval_tidy(pair$rhs, env:  default_env): NAs introduced by coercion

## Warning in eval_tidy(pair$rhs, env:  default_env): NAs introduced by coercion

## WE'll full join and not use rbind, because we have different variables in different waves.
demography <- Reduce ( full_join, demography )

## Joining, by:  c("rowid", "isocntry", "w1", "wex", "marital_status", "age_education", "age_exact", "occupation_of_respondent", "occupation_of_respondent_recoded", "respondent_occupation_scale_c_14", "type_of_community", "is_student", "no_education", "education")
## Joining, by:  c("rowid", "isocntry", "w1", "wex", "marital_status", "age_education", "age_exact", "occupation_of_respondent", "occupation_of_respondent_recoded", "respondent_occupation_scale_c_14", "type_of_community", "is_student", "no_education", "education")
## Joining, by:  c("rowid", "isocntry", "w1", "wex", "marital_status", "age_education", "age_exact", "occupation_of_respondent", "occupation_of_respondent_recoded", "respondent_occupation_scale_c_14", "type_of_community", "is_student", "no_education", "education")
## Joining, by:  c("rowid", "isocntry", "w1", "wex", "marital_status", "age_education", "age_exact", "occupation_of_respondent", "occupation_of_respondent_recoded", "respondent_occupation_scale_c_14", "type_of_community", "is_student", "no_education", "education")

Now let’s see what we have here:

set.seed(2021)
sample_n(demography, 12)

## # A tibble: 12 x 14
##    rowid    isocntry    w1    wex marital_status        age_education  age_exact
##    <chr>    <chr>    <dbl>  <dbl> <chr>                 <chr>              <dbl>
##  1 ZA7488_~ SI       0.828  1428. (Re-)Married: withou~ 19                    43
##  2 ZA7488_~ PL       1.01  32830. (Re-)Married: withou~ 19                    64
##  3 ZA6861_~ DK       0.641  3100. (Re-)Married: withou~ 22                    78
##  4 ZA6861_~ FI       1.83   8601. (Re-)Married: childr~ 30                    38
##  5 ZA7572_~ SE       0.342  2645. (Re-)Married: withou~ 17                    68
##  6 ZA7572_~ IT       0.630 32287. (Re-)Married: childr~ 20                    40
##  7 ZA6861_~ IE       0.868  3054. (Re-)Married: childr~ 32                    42
##  8 ZA6861_~ RO       0.724 11805. (Re-)Married: withou~ 14                    59
##  9 ZA7488_~ CY       0.691  1013. (Re-)Married: childr~ 18                    67
## 10 ZA6595_~ HR       0.580  2098. Single living w part~ 27                    30
## 11 ZA7572_~ CZ       1.86  16908. Single: without chil~ still studying        20
## 12 ZA6861_~ PT       0.932  7448. Widow: with children  no full-time ~        84
## # ... with 7 more variables: occupation_of_respondent <chr>,
## #   occupation_of_respondent_recoded <chr>,
## #   respondent_occupation_scale_c_14 <chr>, type_of_community <chr>,
## #   is_student <dbl>, no_education <dbl>, education <dbl>

Harmonizing Variable Labels

So far we have been working with metadata, weights and socio-demography. In other words, we have not even started the desired harmonization of climate change awareness. The methodology is the same, but here we really must look out for the answer options in the questionnaire. (Refer to our data summary again here.)

climate_awareness_metadata <- eb_climate_metadata %>%
  suggest_var_names( survey_program:  "eurobarometer" ) %>%
  filter ( .data$var_name_suggested  %in% c("rowid",
                                            "serious_world_problems_first", 
                                             "serious_world_problems_climate_change")
  ) 

hw <- harmonize_var_names ( waves:  eb_waves, 
                            metadata:  climate_awareness_metadata )

The retroharmoinze package comes with a generic harmonize_values() function that will change the value labels of categorical variables (including binary ones) to a unitary format. It will also take care of various types of missing values.

First, let’s go back to our metadata and collect all value labels that will show up with collect_val_labels():

collect_val_labels(climate_awareness_metadata)

##  [1] "Climate change"                            
##  [2] "International terrorism"                   
##  [3] "Poverty, hunger and lack of drinking water"
##  [4] "Spread of infectious diseases"             
##  [5] "The economic situation"                    
##  [6] "Proliferation of nuclear weapons"          
##  [7] "Armed conflicts"                           
##  [8] "The increasing global population"          
##  [9] "Other (SPONTANEOUS)"                       
## [10] "None (SPONTANEOUS)"                        
## [11] "Not mentioned"                             
## [12] "Mentioned"                                 
## [13] "DK"

In this case, we want to select Climate change as the mentioned most serious problem, and Climate change taken from a list of three serious problems. The first question type is a single-choice one, where Climate change is either mentioned, or the alternative answer is labeled as Not mentioned. In the multiple choice case, the alternative may be something else, for example, Spread of infectious diseases, as we all well know by 2021.

We want to see who thought Climate change was the most serious problem, or one of the most serious problems, so we label each mentions of Climate change as mentioned and we pair it with a numeric value of 1. All other cases are labeled as not_mentioned, with the exceptions of various missing observations, which in these cases are Do not know answers, Declined to answer cases, and Inappropriate cases [The latter one is Eurobarometer’s label for questions that were for one reason or other not asked from a particular interviewee – for example, because the Turkish Cypriot community received a different questionnaire.]

# positive cases
label_1:  c("^Climate\\schange", "^Mentioned")
# missing cases 
na_labels <- collect_na_labels( climate_awareness_metadata)
na_labels

## [1] "DK"                             "Inap. (10 or 11 in qa1a)"      
## [3] "Inap. (coded 10 or 11 in qc1a)" "Inap. (coded 10 or 11 in qb1a)"

# negative cases
label_0 <- collect_val_labels( climate_awareness_metadata)
label_0 <- label_0[! label_0 %in% label_1 ]

The harmonize_serious_problems() function harmonizes the labels within the special labeled class of retroharmonize. This class retains all information to give categorical variables a character or numeric representation, and various processing metadata for documentation purposes. While this class is very reach (it contains whatever was imported from SPSS’s proprietary data format and the history), it is not suitable for statistical analysis. We could, of course, directly call the harmonize_values() from the retroharmonize package, but the parameterization would be very complicated even in a simple function call, not to mention a looped call. Because this function is the heart of the retroharmonize package, it has a tutorial article on its own.

harmonize_serious_problems <- function(x) {
  label_list <- list(
    from:  c(label_0, label_1, na_labels), 
    to:  c( rep ( "not_mentioned", length(label_0) ),   # use the same order as in from!
            rep ( "mentioned", length(label_1) ),
            "do_not_know", "inap", "inap", "inap"), 
    numeric_values:  c(rep ( 0, length(label_0) ), # use the same order as in from!
                       rep ( 1, length(label_1) ),
                       99997,99999,99999,99999)
  )
  
  harmonize_values(x, 
                   harmonize_labels:  label_list, 
                   na_values:  c("do_not_know"=99997,
                                 "declined"=99998,
                                 "inap"=99999), 
                   remove:  "\\(|\\)|\\[|\\]|\\%"
  )
}

Our objects are rather big in memory, so first, let’s remove the surveys that do not contain these world problem variables. In this cases, the subsetted and harmonized surveys in the nested list have only one columns, i.e. the rowid.

hw <- hw[unlist ( lapply ( hw, ncol)) > 1 ]

Now we have a smaller problem to deal with. With many surveys, it is easy to fill up your computer’s memory, so let’s start building up our joined panel data from a smaller set of nested, subsetted surveys.

hw <- lapply ( hw, function (x) x %>% mutate ( across ( contains("problem"), harmonize_serious_problems) ) )

Our lapply loop calls an anonymous function which in turn calls the harmonize_serious_problems parameterized version of the harmonize_values() on all variables that have problem in their names.

once we are done, our variables have harmonized names, and harmonized values, and harmonized label, but they are stored in the complex retroharmonize_labelled_spss_survey class, inherited from the haven_labelled_spss in haven.

We reduced our single and multiple choice questions to binary choice variables. We can now give them a numeric representation. Be mindful that retroharmonize has special methods for its special labeled class that retains metadata from SPSS. This means that as_character and as_numeric knows how to handle various types of missing values, whereas the base R as.character and as.numeric may coerce special values to unwanted results. This is particularly dangerous with numeric variables – and this is the reason why we introduced a new set of S3 objects and methods in the package.

We will ignore the differences between various forms of missingness, i.e. the person said that she did not know, or did not want to answer, or for some reason was not asked in the survey. In a more descriptive, non-harmonized analysis you would probably want to explore them as various ‘categories’ and use a character representation.

hw <- lapply ( hw, function(x) x %>% mutate ( across ( contains("problem"), as_numeric) ))

hw <- Reduce ( full_join, hw) # we must use joins instead of binds because the number of columns vary.

Let’s see what we have:

set.seed(2021)
sample_n (hw, 12)

## # A tibble: 12 x 3
##    rowid             serious_world_problems_fi~ serious_world_problems_climate_~
##    <chr>                                  <dbl>                            <dbl>
##  1 ZA6595_v3-0-0_23~                          0                               NA
##  2 ZA7572_v1-0-0_70~                          0                                0
##  3 ZA6595_v3-0-0_18~                          0                               NA
##  4 ZA6861_v1-2-0_27~                          0                                0
##  5 ZA6595_v3-0-0_26~                          0                               NA
##  6 ZA7572_v1-0-0_19~                          0                                1
##  7 ZA5877_v2-0-0_16~                          0                                0
##  8 ZA6861_v1-2-0_12~                          0                                0
##  9 ZA7572_v1-0-0_17~                          0                                0
## 10 ZA5877_v2-0-0_17~                          0                                1
## 11 ZA6861_v1-2-0_41~                          0                                0
## 12 ZA6861_v1-2-0_61~                          0                                1

Creating the Longitudional Table

Now we just need to join the partial table by the rowid together:

#start from the smallest (we removed the survey that had no relevant questionnaire item)
panel <- hw %>%
  left_join ( geography, by:  'rowid' ) 

panel <- panel %>%
  left_join ( demography, by:  c("rowid", "isocntry") ) 

panel <- panel %>%
  left_join ( interview_dates, by:  'rowid' )

And let’s see a small sample:

sample_n(panel, 12)

## # A tibble: 12 x 19
##    rowid  serious_world_pr~ serious_world_pr~ isocntry geo   region    w1    wex
##    <chr>              <dbl>             <dbl> <chr>    <chr> <chr>  <dbl>  <dbl>
##  1 ZA686~                 0                 0 ES       ES41  Casti~ 1.21  46787.
##  2 ZA686~                 0                 0 RO       RO31  South~ 0.724 11805.
##  3 ZA686~                 0                 0 SK       SK02  Zapad~ 0.774  3499.
##  4 ZA757~                 0                 1 PT       PT16  Centr~ 1.11   9336.
##  5 ZA659~                 1                NA HR       HR041 Grad ~ 0.580  2098.
##  6 ZA659~                 1                NA RO       RO21  North~ 1.21  20160.
##  7 ZA686~                 0                 0 PT       PT17  Lisboa 0.932  7448.
##  8 ZA659~                 0                NA GB-GBN   UKI   London 0.994 50133.
##  9 ZA757~                 0                 0 CY       CY    REPUB~ 0.594   874.
## 10 ZA686~                 0                 0 LT       LT003 Klaip~ 0.623  1564.
## 11 ZA757~                 0                 0 IE       IE013 West ~ 0.490  1651.
## 12 ZA659~                 0                NA LT       LT003 Klaip~ 1.16   2917.
## # ... with 11 more variables: marital_status <chr>, age_education <chr>,
## #   age_exact <dbl>, occupation_of_respondent <chr>,
## #   occupation_of_respondent_recoded <chr>,
## #   respondent_occupation_scale_c_14 <chr>, type_of_community <chr>,
## #   is_student <dbl>, no_education <dbl>, education <dbl>,
## #   date_of_interview <date>

saveRDS ( panel, file.path(tempdir(), "climate_panel.rds"), version:  2)

# not evaluated
saveRDS( panel, file:  file.path("data-raw", "climate-panel.rds"), version=2)

Putting It on a Map

This is not the end of the story. If you put all this on a map, the results are a bit disappointing.

Why? Because sub-national (provincial, state, county, district, parish) borders are changing all the time - within the EU and everywhere. The next step is to harmonize the geographical information. We have another CRAN released package to help you with. See the next post: Regional Climate Change Awareness Dataset.

What is Retrospective Survey Harmonization?

Thu, 04 Mar 2021 00:00:00 +0000

Reproducible ex post harmonization of survey microdata

Retrospective survey harmonization allows the comparison of opinion poll data conducted in different countries or time. In this example we are working with data from surveys that were ex ante harmonized to a certain degree – in our tutorials we are choosing questions that were asked in the same way in many natural languages. For example, you can compare what percentage of the European people in various countries, provinces and regions thought climate change was a serious world problem back in 2013, 2015, 2017 and 2019.

We developed the retroharmonize R package to help this process. We have tested the package with about 80 Eurobarometer, 5 Afrobarometer survey files extensively, and a bit with Arabbarometer files. This allows the comparison of various survey answers in about 70 countries. This policy-oriented survey programs were designed to be harmonized to a certain degree, but their ex post harmonization is still necessary, challenging and errorprone. Retrospective harmonization includes harmonization of the different coding used for questions and answer options, post-stratification weights, and using different file formats.

Eurobarometer, Afrobaromer, Arab Barometer and Latinobarómetro make survey files that are harmonized across countries available for research with various terms. Our retroharmonize is not affiliated with them, and to run our examples, you must visit their websites, carefully read their terms, agree to them, and download their data yourself. What we add as a value is that we help to connect their files across time (from different years) or across these programs.

The survey programs mentioned above publish their data in the proprietary SPSS format. This file format can be imported and translated to R objects with the haven package; however, we needed to re-design haven’s labelled_spss class to maintain far more metadata, which, in turn, a modification of the labelled class. The haven package was designed and tested with data stored in individual SPSS files.

The author of labelled, Joseph Larmarange describes two main approaches to work with labelled data, such as SPSS’s method to store categorical data in the Introduction to labelled.

Two main approaches of labelled data conversion.

Our approach is a further extension of Approach B. Survey harmonization in our case always means the joining data from several SPSS files, which requires a consistent coding among several data sources. This means that data cleaning and recoding must take place before conversion to factors, character or numeric vectors. This is particularly important with factor data (and their simple character conversions) and numeric data that occasionally contains labels, for example, to describe the reason why certain data is missing. Our tutorial vignette labelled_spss_survey gives you more information about this.

In the next series of tutorials, we will deal with an array of problems. These are not for the faint heart – you need to have a solid intermediate level of R to follow.

Tidy, joined survey data

The original files identifiers may not be unique, we have to create new, truly unique identifiers. Weighting may not be straightforward.
Neither the number of observations or the number of variables (which represents the survey questions and their translation to coded data) is the same. Certain data may be only present in one survey and not the other. This means that you will likely to run loops on lists and not data.frames, but eventually you must carefully join them.

Class conversion

Similar questions may be imported from a non-native R format, in our case, from an SPSS files, in an inconsistent manner. SPSS’s variable formats cannot be translated unambiguously to R classes. retroharmonize introduced a new S3 class system that handles this problem, but eventually you will have to choose if you want to see a numeric or character coding of each categorical variable.
The harmonized surveys, with harmonized variable names and harmonized value labels, must be brought to consistent R representations (most statistical functions will only work on numeric, factor or character data) and carefully joined into a single data table for analysis.

Harmonization of variables and variable labels

Same variables may come with dissimilar variable names and variable labels. It may be a challenge to match age with age. We need to harmonize the names of variables.
The harmonized variables may have different labeling. One may call refused answers as declined and the other refusal. On a simple choice, climate change may be ‘Climate change’ or Problem: Climate change. Binary choices may have survey-specific coding conventions. Value labels must be harmonized. There are good tools to do this in a single file - but we have to work with several of them.

Missing value harmonization

There are likely to be various types of missing values. Working with missing values is probably where most human judgment is needed. Why are some answers missing: was the question not asked in some questionnaires? Is there a coding error? Did the respondent refuse the question, or sad that she did not have an answer? retroharmonize has a special labeled vector type that retains this information from the raw data, if it is present, but you must make the judgment yourself – in R, eventually you will either create a missing category, or use NA_character_ or NA_real_.

That’s a lot to put on your plate.

It is unlikely that you will be able to work with completely unfamiliar survey programs if you do not have a strong intermediate level of R. Our package comes with tutorials for Eurobarometer, Afrobarometer and our development version already covers Arab Barometer, highlighting some peculiar issues with these survey programs, that we hope to give a head start for less experienced R users.

Eurobarometer Surveys Used In Our Project

Wed, 03 Mar 2021 00:00:00 +0000

In our tutorial series, we are going to harmonize the following questionnaire items from five Eurobarometer harmonized survey files. The Eurobarometer survey files are harmonized across countries, but they are only partially harmonized in time.

All data must be downloaded from the GESIS Data Archive in Cologne. We are not affiliated with GESIS and you must read and accept their terms to use the data.

Eurobarometer 80.2 (2013)

GESIS Data Archive, Cologne. ZA5877 Data file Version 2.0.0, https://doi.org/10.4232/1.12792

Data file: ZA6595 data file (European Commission 2017).
Questionnaire: Eurobarometer 83.4 Basic Bilingual Questionnaire
Citation: ZA6595 Bibtex

QA1a Which of the following do you consider to be the single most serious problem facing the world as a whole? (single choice)

QA1b Which others do you consider to be serious problems? (multiple choice)

QA2 And how serious a problem do you think climate change is at this moment? Please use a scale from 1 to 10, with '1' meaning it is "not at all a serious problem (scale 1-10)

QA4 To what extent do you agree or disagree with each of the following statements? - Fighting climate change and using energy more efficiently can boost the economy and jobs in the EU (agreement-disagreement 4-scale)

QA4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU could benefit the EU economically (agreement-disagreement 4-scale)

QA5 Have you personally taken any action to fight climate change over the past six months? (binary)

Eurobarometer 83.4 (2015)

European Commission, Brussels; Directorate General Communication COMM.A.1 ´Strategy, Corporate Communication Actions and Eurobarometer´GESIS Data Archive, Cologne. ZA6595 Data file Version 3.0.0, https://doi.org/10.4232/1.13146

Data file: ZA6595 data file (European Commission 2018).
Questionnaire: Eurobarometer 83.4 Basic Bilingual Questionnaire
Citation: ZA6595 Bibtex

Eurobarometer 87.1 (2017)

European Commission, Brussels; Directorate General Communication, COMM.A.1 ‘Strategic Communication’; European Parliament, Directorate-General for Communication, Public Opinion Monitoring Unit GESIS Data Archive, Cologne. ZA6861 Data file Version 1.2.0, https://doi.org/10.4232/1.12922

Data file: ZA6861 data file.
Questionnaire: Eurobarometer 90.2 Basic Bilingual Questionnaire
Citation: ZA6861 Bibtex

QC1a Which of the following do you consider to be the single most serious problem facing the world as a whole? (single choice)

QC1b Which others do you consider to be serious problems? (multiple choice)

QC2 And how serious a problem do you think climate change is at this moment? Please use a scale from 1 to 10, with '1' meaning it is "not at all a serious problem (scale 1-10)

Qc4 To what extent do you agree or disagree with each of the following statements? - Fighting climate change and using energy more efficiently can boost the economy and jobs in the EU (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - Promoting EU expertise in new clean technologies to countries outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can increase the security of EU energy supplies (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - More public financial support should be given to the transition to clean energies even if it means subsidies to fossil fuels should be reduced. (agreement-disagreement 4-scale)

Qc5 Have you personally taken any action to fight climate change over the past six months? (binary)

Eurobarometer 90.2 (2018)

European Commission, Brussels; Directorate General Communication, COMM.A.3 ‘Media Monitoring and Eurobarometer’ GESIS Data Archive, Cologne. ZA7488 Data file Version 1.0.0, https://doi.org/10.4232/1.13289

Data file: ZA7488 data file (European Commission 2019a)
Questionnaire: Eurobarometer 90.2 Basic Bilingual Questionnaire
Citation: ZA7488 Bibtex

QB5 To what extent do you agree or disagree with each of the following statements? - Fighting climate change and using energy more efficiently can boost the economy and jobs in the EU (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - Promoting EU expertise in new clean technologies to countries outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can increase the security of EU energy supplies (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - More public financial support should be given to the transition to clean energies even if it means subsidies to fossil fuels should be reduced. (agreement-disagreement 4-scale)

Eurobarometer 91.3 (2019)

European Commission, Brussels; Directorate General Communication, COMM.A.3 ‘Media Monitoring and Eurobarometer’ GESIS Data Archive, Cologne. ZA7572 Data file Version 1.0.0, https://doi.org/10.4232/1.13372

Data file: ZA7572 data file (European Commission 2019b).
Questionnaire: Eurobarometer 91.3 Basic Bilingual Questionnaire
Citation: ZA7572 Bibtex

QB4 To what extent do you agree or disagree with each of the following statements? - Taking action on climate change will lead to innovation that will make EU companies more competitive (N) (agreement-disagreement 4-scale)

QB4 To what extent do you agree or disagree with each of the following statements? - Promoting EU expertise in new clean technologies to countries outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB4 To what extent do you agree or disagree with each of the following statements? - Adapting to the adverse impacts of climate change can have positive outcomes for citizens in the EU (agreement-disagreement 4-scale)

QB5 Have you personally taken any action to fight climate change over the past six months? (binary)

References

European Commission, Brussels. 2017. “Eurobarometer 80.2 (2013).” GESIS Data Archive, Cologne. ZA5877 Data file Version 2.0.0, https://doi.org/10.4232/1.12792. https://doi.org/10.4232/1.12792.

———. 2018. “Eurobarometer 83.4 (2015).” GESIS Data Archive, Cologne. ZA6595 Data file Version 3.0.0, https://doi.org/10.4232/1.13146. https://doi.org/10.4232/1.13146.

———. 2019a. “Eurobarometer 90.2 (2018).” GESIS Data Archive, Cologne. ZA7488 Data file Version 1.0.0, https://doi.org/10.4232/1.13289. https://doi.org/10.4232/1.13289.

———. 2019b. “Eurobarometer 91.3 (2019).” GESIS Data Archive, Cologne. ZA7572 Data file Version 1.0.0, https://doi.org/10.4232/1.13372. https://doi.org/10.4232/1.13372.