R | Green Deal Data Observatory

Learn R with Reprex

Fri, 07 Oct 2022 12:35:00 +0200

Big Data Creates Inequalities

Only the largest corporations, best-endowed universities, and rich governments can afford data collection and processing capacities that are large enough to harness the advantages of AI.

Fullscreen: F

Next: ️> or Space | Previous :️<
Start: Home | Finish: End
Overview: Esc| Speaker notes: S
Zoom: Alt + Click 🖱️

Big data that works for all

No matter how big is the problem or how small is your team, `Reprex` fill your reports, dashboards, newsletters, books with data and its visualization.
Learn R with us: you can reduce the inequalities by joining the open source movement, learning to run open source software, ask for help, improve the tutorials, the documentation, and eventually learn to make the computer work for you.
Contributor Covenant: Participating in open source is often a highly collaborative experience. We’re encouraged to create in public view, and we’re incentivized to welcome contributions of all kinds from people around the world. This makes the practice of open source as much social as it is technical.

Get Inspired

Find more interesting and better data: you don’t have to be a data scientist or write code to contribute to our projects.
Data feminism: Catherine D’Ignazio and Lauren Klein present a new way of thinking about data science and data ethics—one that is informed by intersectional feminist thought. Highly inspirational, free, open-source book.
RLadies is a world-wide organization to promote gender diversity in the R community.

Contributor Covenant

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

Run code from tutorials

retroharmonize.dataobservatory.eu
🖱 Get started
🖱️ Articles

Find help, ask for help: reprex

Documentation for better tutorials

Debugging and testing code

Contribute to documentation

R is a functional language

R is both a statistical environment and a programming language
R, the open source and further developed version of the S language, is mainly functional
If you did a task at least twice, the 3rd time you better write a function script to keep doing it forever.
Most of your effort will be to find a well-written function for your work
If you cannot find a function, you will modify somebody else’s function, or eventually write your own

R + YAML + markdown = web ready

Learn YAML in Y minutes: tell the computer what you want to do with a document
R Markdown basics: it is just a plain markdown that allows you to insert little R program ‘chunks’.
Awesome markdown editors and pre-writers: find a convenient tool
Google Docs to markdown: practice by translating your Google Docs text to markdown. It is very easy.

Package and release: a team effort

Our open source development projects

🔢 dataset: Synchronize datasets with global knowledge hubs #️⃣ statcodelists: Make your data codes understood globally ♻️ iotables: Create economic or environmental impact assessments in any EU country 🌍 regions: Create from raw survey data more granular statistics in any EU country ✅ retroharmonize: Harmonize questions banks, recycle answers from past surveys ⏭️ all in on one page

Create with us

Questions?

Email | Keybase

LinkedIn: Daniel Antal - Reprex | Home

stacodelists: use standard, language-independent variable codes to help international data interoperability and machine reuse in R

Wed, 29 Jun 2022 08:12:00 +0100

Visit the documentation website of statcodelists on statcodelists.dataobservatory.eu/.

The goal of statcodelists is to promote the reuse and exchange of statistical information and related metadata with making the internationally standardized SDMX code lists available for the R user. SDMX – the Statistical Data and Metadata eXchange has been published as an ISO International Standard (ISO 17369). The metadata definitions, including the codelists are updated regularly according to the standard. The authoritative version of the code lists made available in this package is https://sdmx.org/?page_id=3215/.

Click to expand table of contents of the post

Table of Contents

Purpose

Cross-domain concepts in the SDMX framework describe concepts relevant to many, if not all, statistical domains. SDMX recommends using these concepts whenever feasible in SDMX structures and messages to promote the reuse and exchange of statistical information and related metadata between organisations.

Code lists are predefined sets of terms from which some statistical coded concepts take their values. SDMX cross-domain code lists are used to support cross-domain concepts. What are these cross-domain coded concepts?

Geographical codes, like NL: the Netherlands in the CL_AREA code list.
Standard industry codes J631 for Data processing, hosting and related activities in Europe. (NACE Rev 2 in Europe, beware, it is J592in Australia and New Zealand, see CL_ACTIVITY_ANZSIC06.)
Occupations, like OC2521 for Database designers and administrators in CL_OCCUPATIONS
Time fomatting standards, like CCYY for annual data series in CL_TIME_FORMAT.

Check out the available codlists on the package homepage.

The use of common code lists will help users to work even more efficiently, easing the maintenance of and reducing the need for mapping systems and interfaces delivering data and metadata to them. A very obvious advantage of using the code systems is that you can retrieve data from national sources indifferent of the natural language used in North Macedonia, Japan, the U.S. or the Netherlands. While the data labels may change to be locally human-readable, computers and geeks can read the codes and understand them immediately. Provided that they use the standard codes.

Our data observatories are rolling out SDMX coding across all datasets to help data ingestion and interoperability, data findability and data reuse. statcodelists can help the use of standard SDMX codes in your R workflow–both for downloading data from statistical agencies and to produce publication-ready datasets that the rest of the world (and even APIs) will understand.

Installation

You can install statcodelists from CRAN:

install.packages("statcodelists")

Further recommended code values for expressing general statistical concepts like not applicable, etc., can be found in section Generic codes of the Guidelines for the creation and management of SDMX Cross-Domain Code Lists.

For further codelists used by reliable statistical agency but not harmonized on SDMX level please consult the SDMX Global Registry Codelists page.

The creator of this package is not affiliated with SDMX, and this package was has not been endorsed by SDMX.

Code of Conduct

Please note that the statcodelists project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Including Indicators from Arab Barometer in Our Observatory

Mon, 28 Jun 2021 09:00:00 +0000

A new version of the retroharmonize R package – which is working with retrospective, ex post harmonization of survey data – was released yesterday after peer-review on CRAN. It allows us to compare opinion polling data from the Arab Barometer with the Eurobarometer and Afrorbarometer. This is the first version that is released in the rOpenGov community, a community of R package developers on open government data analytics and related topics.

Surveys are the most important data sources in social and economic statistics – they ask people about their lives, their attitudes and self-reported actions, or record data from companies and NGOs. Survey harmonization makes survey data comparable across time and countries. It is very important, because often we do not know without comparison if an indicator value is low or high. If 40% of the people think that climate change is a very serious problem, it does not really tell us much without knowing what percentage of the people answered this question similarly a year ago, or in other parts of the world.

With the help of Ahmed Shabani and Yousef Ibrahim, we created a third case study after the Eurobarometer, and Afrobarometer, about working with the Arab Barometer harmonized survey data files.

Ex ante survey harmonization means that researchers design questionnaires that are asking the same questions with the same survey methodology in repeated, distinct times (waves), or across different countries with carefully harmonized question translations. Ex post harmonizations means that the resulting data has the same variable names, same variable coding, and can be joined into a tidy data frame for joint statistical analysis. While seemingly a simple task, it involves plenty of metadata adjustments, because established survey programs like Eurobarometer, Afrobarometer or Arab Barometer have several decades of history, and several decades of coding practices and file formatting legacy.

Variable harmonization means that if the same question is called in one microdata source Q108 and the other eval-parl-elections then we make sure that they get a harmonize and machine readable name without spaces and special characters.
Variable label harmonization means that the same questionnaire items get the same numeric coding and same categorical labels.
Missing case harmonization means that various forms of missingness are treated the same way.

For the climate awareness dataset get the country averages and aggregates from Zenodo, and the plot in jpg or png from figshare.

In our new Arab Barometer case study, the evaulation of parliamentary elections has the following labels. We code them consistently 1: free_and_fair, 2: some_minor_problems, 3: some_major_problems and 4: not_free.

“0. missing”	“1. they were completely free and fair”
“2. they were free and fair, with some minor problems”	“3. they were free and fair, with some major problems”
“4. they were not free and fair”	“8. i don’t know”
“9. declined to answer”	“Missing”
“They were completely free and fair”	“They were free and fair, with some minor breaches”
“They were free and fair, with some major breaches”	“They were not free and fair”
“Don’t know”	“Refuse”
“Completely free and fair”	“Free and fair, but with minor problems”
“Free and fair, with major problems”	“Not free or fair”
“Don’t know (Do not read)”	“Decline to answer (Do not read)”

Of course, this harmonization is essential to get clean results like this:

For evaluation or reuse of parliamentary elections dataset get the replication data and the code from the Zenodo open repository.

In our case study, we had three forms of missingness: the respondent did not know the answer, the respondent did not want to answer, and at last, in some cases the respondent was not asked, because the country held no parliamentary elections. While in numerical processing, all these answers must be left out from calculating averages, for example, in a more detailed, categorical analysis they represent very different cases. A high level of refusal to answer may be an indicator of surpressing democratic opinion forming in itself.

Survey harmonization with many countries entails tens of thousands of small data management task, which, unless automatically documented, logged, and created with a reproducible code, is a helplessly error-prone process. We believe that our open-source software will bring many new statistical information to the light, which, while legally open, was never processed due to the large investment needed.

We also started building experimental APIs data is running retroharmonize regularly. We will place cultural access and participation data in the Digital Music Observatory, climate awareness, policy support and self-reported mitigation strategies into the Green Deal Data Observatory, and economy and well-being data into our Economy Data Observatory.

Further plans

Retrospective survey harmonization is a far more complex task than this blogpost suggest. Retrospective survey harmonization is a far more complex task than this blogpost suggest, because established survey programs have gathered decades of legacy data in legacy coding schemes and legacy file formats. Putting the data right, and especially putting the invaluable descriptive and administrative (processing) metadata right is a huge undertaking. We are releasing example codes, datasets and charts for researchers to comapre our harmonized results with theirs, and improve our software. We are releasing example codes, datasets and charts for researchers to comapre our harmonized results with theirs, and improve our software.

Use our software

The retroharmonize R package can be freely used, modified and distributed under the GPL-3 license. For the main developer and contributors, see the package homepage. If you use it for your work, please kindly cite it as:

Daniel Antal (2021). retroharmonize: Ex Post Survey Data Harmonization. R package version 0.1.17. https://doi.org/10.5281/zenodo.5034752

Download the BibLaTeX entry.

Tutorial to work with the Arab Barometer survey data

Daniel Antal, & Ahmed Shaibani. (2021, June 26). Case Study: Working With Arab Barometer Surveys for the retroharmonize R package (Version 0.1.6). Zenodo. https://doi.org/10.5281/zenodo.5034759

For the replication data to report potential issues and improvement suggestions with the code:

Daniel Antal, & Ahmed Shaibani. (2021). Replication Data for the retroharmonize R Package Case Study: Working With Arab Barometer Surveys (Version 0.1.6) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5034741

Experimental API

We are also experimenting with the automated placement of authoritative and citeable figures and datasets in open repositories. For the climate awareness dataset get the country averages and aggregates from Zenodo, and the plot in jpg or png from figshare. Our plan is to release open data in a modern API with rich descriptive metadata meeting the Dublin Core and DataCite standards, and further administrative metadata for correct coding, joining and further manipulating or data, or for easy import into your database.

Join our open source effort

Want to help us improve our open data service? Include Lationbarómetro and the Caucasus Barometer in our offering? Join the rOpenGov community of R package developers, an our open collaboration to create the automated data observatories. We are not only looking for developers, but data curators and service design associates, too.

Open Data - The New Gold Without the Rush

Fri, 18 Jun 2021 17:00:00 +0000

If open data is the new gold, why even those who release fail to reuse it? We created an open collaboration of data curators and open-source developers to dig into novel open data sources and/or increase the usability of existing ones. We transform reproducible research software into research- as-service.

Every year, the EU announces that billions and billions of data are now “open” again, but this is not gold. At least not in the form of nicely minted gold coins, but in gold dust and nuggets found in the muddy banks of chilly rivers. There is no rush for it, because panning out its value requires a lot of hours of hard work. Our goal is to automate this work to make open data usable at scale, even in trustworthy AI solutions.

There is no rush for it, because panning out its value requires a lot of hours of hard work. Our goal is to automate this work to make open data usable at scale, even in trustworthy AI solutions.

Most open data is not public, it is not downloadable from the Internet – in the EU parlance, “open” only means a legal entitlement to get access to it. And even in the rare cases when data is open and public, often it is mired by data quality issues. We are working on the prototypes of a data-as-service and research-as-service built with open-source statistical software that taps into various and often neglected open data sources.

We are in the prototype phase in June and our intentions are to have a well-functioning service by the time of the conference, because we are working only with open-source software elements; our technological readiness level is already very high. The novelty of our process is that we are trying to further develop and integrate a few open-source technology items into technologically and financially sustainable data-as-service and even research-as-service solutions.

Our review of about 80 EU, UN and OECD data observatories reveals that most of them do not use these organizations’s open data - instead they use various, and often not well processed proprietary sources.

We are taking a new and modern approach to the data observatory concept, and modernizing it with the application of 21st century data and metadata standards, the new results of reproducible research and data science. Various UN and OECD bodies, and particularly the European Union support or maintain more than 60 data observatories, or permanent data collection and dissemination points, but even these do not use these organizations and their members open data. We are building open-source data observatories, which run open-source statistical software that automatically processes and documents reusable public sector data (from public transport, meteorology, tax offices, taxpayer funded satellite systems, etc.) and reusable scientific data (from EU taxpayer funded research) into new, high quality statistical indicators.

We are taking a new and modern approach to the ‘data observatory’ concept, and modernizing it with the application of 21st century data and metadata standards, the new results of reproducible research and data science

We are building various open-source data collection tools in R and Python to bring up data from big data APIs and legally open, but not public, and not well served data sources. For example, we are working on capturing representative data from the Spotify API or creating harmonized datasets from the Eurobarometer and Afrobarometer survey programs.
Open data is usually not public; whatever is legally accessible is usually not ready to use for commercial or scientific purposes. In Europe, almost all taxpayer funded data is legally open for reuse, but it is usually stored in heterogeneous formats, processed into an original government or scientific need, and with various and low documentation standards. Our expert data curators are looking for new data sources that should be (re-) processed and re-documented to be usable for a wider community. We would like to introduce our service flow, which touches upon many important aspects of data scientist, data engineer and data curatorial work.
We believe that even such generally trusted data sources as Eurostat often need to be reprocessed, because various legal and political constraints do not allow the common European statistical services to provide optimal quality data – for example, on the regional and city levels.
With rOpenGov and other partners, we are creating open-source statistical software in R to re-process these heterogenous and low-quality data into tidy statistical indicators to automatically validate and document it.
We are carefully documenting and releasing administrative, processing, and descriptive metadata, following international metadata standards, to make our data easy to find and easy to use for data analysts.
We are automatically creating depositions and authoritative copies marked with an individual digital object identifier (DOI) to maintain data integrity.
We are building simple databases and supporting APIs that release the data without restrictions, in a tidy format that is easy to join with other data, or easy to join into databases, together with standardized metadata.
We maintain observatory websites (see: Digital Music Observatory, Green Deal Data Observatory, Economy Data Observatory) where not only the data is available, but we provide tutorials and use cases to make it easier to use them. Our mission is to show a modern, 21st century reimagination of the data observatory concept developed and supported by the UN, EU and OECD, and we want to show that modern reproducible research and open data could make the existing 60 data observatories and the planned new ones grow faster into data ecosystems.

We are working around the open collaboration concept, which is well-known in open source software development and reproducible science, but we try to make this agile project management methodology more inclusive, and include data curators, and various institutional partners into this approach. Based around our early-stage startup, Reprex, and the open-source developer community rOpenGov, we are working together with other developers, data scientists, and domain specific data experts in climate change and mitigation, antitrust and innovation policies, and various aspects of the music and film industry.

Our open collaboration is truly open: new data curators,developers and service designers, even volunteers and citizen scientists are welcome to join.

Our open collaboration is truly open: new data curators, data scientists and data engineers are welcome to join. We develop open-source software in an agile way, so you can join in with an intermediate programming skill to build unit tests or add new functionality, and if you are a beginner, you can start with documentation and testing our tutorials. For business, policy, and scientific data analysts, we provide unexploited, exciting new datasets. Advanced developers can join our development team: the statistical data creation is mainly made in the R language, and the service infrastructure in Python and Go components.

Analyze Locally, Act Globally: New regions R Package Release

Wed, 16 Jun 2021 12:00:00 +0000

The new version of our rOpenGov R package regions was released today on CRAN. This package is one of the engines of our experimental open data-as-service Green Deal Data Observatory, Economy Data Observatory, Digital Music Observatory prototypes, which aim to place open data packages into open-source applications.

Click to expand table of contents of the post

Table of Contents

In international comparison the use of nationally aggregated indicators often have many disadvantages: they inhibit very different levels of homogeneity, and data is often very limited in number of observations for a cross-sectional analysis. When comparing European countries, a few missing cases can limit the cross-section of countries to around 20 cases which disallows the use of many analytical methods. Working with sub-national statistics has many advantages: the similarity of the aggregation level and high number of observations can allow more precise control of model parameters and errors, and the number of observations grows from 20 to 200-300.

The change from national to sub-national level comes with a huge data processing price: internal administrative boundaries, their names, codes codes change very frequently.

Yet the change from national to sub-national level comes with a huge data processing price. While national boundaries are relatively stable, with only a handful of changes in each recent decade. The change of national boundaries requires a more-or-less global consensus. But states are free to change their internal administrative boundaries, and they do it with large frequency. This means that the names, identification codes and boundary definitions of sub-national regions change very frequently. Joining data from different sources and different years can be very difficult.

Our regions R package helps the data processing, validation and imputation of sub-national, regional datasets and their coding.

There are numerous advantages of switching from a national level of the analysis to a sub-national level comes with a huge price in data processing, validation and imputation, and the regions package aims to help this process.

You can review the problem, and the code that created the two map comparisons, in the Maping Regional Data, Maping Metadata Problems vignette article of the package. A more detailed problem description can be found in Working With Regional, Sub-National Statistical Products.

This package is an offspring of the eurostat package on rOpenGov. It started as a tool to validate and re-code regional Eurostat statistics, but it aims to be a general solution for all sub-national statistics. It will be developed parallel with other rOpenGov packages.

Get the Package

You can install the development version from GitHub with:

devtools::install_github("rOpenGov/regions")

or the released version from CRAN:

install.packages("regions")

You can review the complete package documentation on regions.dataobservaotry.eu. If you find any problems with the code, please raise an issue on Github. Pull requests are welcome if you agree with the Contributor Code of Conduct

If you use regions in your work, please cite the package as: Daniel Antal. (2021, June 16). regions (Version 0.1.7). CRAN. http://doi.org/10.5281/zenodo.4965909

Download the BibLaTeX entry.

Join us

Join our Green Deal Data Observatory collaboration!

Join our open collaboration Green Deal Data Observatory team as a data curator, developer or business developer. More interested in economic policies, particularly computation antitrust, innovation and small enterprises? Check out our Economy Music Observatory team! Or your interest lies more in data governance, trustworthy AI and other digital market problems? Check out our Digital Music Observatory team!

Retrospective Survey Harmonization Case Study - Climate Awareness Change in Europe 2013-2019.

Fri, 05 Mar 2021 00:00:00 +0000

Retrospective survey harmonization comes with many challenges, as we have shown in the introduction to this tutorial case study. In this example, we will work with Eurobarometer’s data.

This code tutorial is not outdated, but the retroharmonize R package has a new (development) release with more featues.

Click to expand table of contents of the post

Table of Contents

Please use the development version of retroharmonize:

devtools::install_github("antaldaniel/retroharmonize")

library(retroharmonize)
library(dplyr)       # this is necessary for the example 
library(lubridate)   # easier date conversion

## Warning: package 'lubridate' was built under R version 4.0.4

library(stringr)     # You can also use base R string processing functions

Get the Data

retroharmonize is not associated with Eurobarometer, or its creators, Kantar, or its archivists, GESIS. We assume that you have acquired the necessary files from GESIS after carefully reading their terms and you placed it on a path that you call gesis_dir. The precise documentation of the data we use can be found in this supporting blogpost. To reproduce this blogpost, you will need ZA5877_v2-0-0.sav, ZA6595_v3-0-0.sav, ZA6861_v1-2-0.sav, ZA7488_v1-0-0.sav, ZA7572_v1-0-0.sav in a directory that you will name gesis_dir.

#Not run in the blogpost. In the repo we have a saved version.
climate_change_files <- c("ZA5877_v2-0-0.sav", "ZA6595_v3-0-0.sav",  "ZA6861_v1-2-0.sav", 
                          "ZA7488_v1-0-0.sav", "ZA7572_v1-0-0.sav")

eb_waves <- read_surveys(file.path(gesis_dir, climate_change_files), .f='read_spss')

if (dir.exists("data-raw")) {
  save ( eb_waves,  file:  file.path("data-raw", "eb_climate_change_waves.rda") )
}

if ( file.exists( file.path("data-raw", "eb_climate_change_waves.rda") )) {
  load (file.path( "data-raw", "eb_climate_change_waves.rda" ) )
} else {
  load (file.path("..", "..",  "data-raw", "eb_climate_change_waves.rda") )
}

The eb_waves nested list contains five surveys imported from SPSS to the survey class of retroharmonize. The survey class is a data.frame that retains important metadata for further harmonization.

document_waves (eb_waves)

## # A tibble: 5 x 5
##   id            filename           ncol  nrow object_size
##   <chr>         <chr>             <int> <int>       <dbl>
## 1 ZA5877_v2-0-0 ZA5877_v2-0-0.sav   604 27919   139352456
## 2 ZA6595_v3-0-0 ZA6595_v3-0-0.sav   519 27718   119370440
## 3 ZA6861_v1-2-0 ZA6861_v1-2-0.sav   657 27901   151397528
## 4 ZA7488_v1-0-0 ZA7488_v1-0-0.sav   752 27339   169465928
## 5 ZA7572_v1-0-0 ZA7572_v1-0-0.sav   348 27655    80562432

Beware the object sizes. If you work with many surveys, memory-efficient programming becomes imperative. We will be subsetting whenever possible.

Metadata analysis

As noted before, prepare to work with nested lists. Each imported survey is nested as a data frame in the eb_waves list.

Metadata: Protocol Variables

Eurobarometer calls certain metadata elements, like interviewee cooperation level or the date of a survey interview as protocol variable. Let’s start here. This will be our template to harmonize more and more aspects of the five surveys (which are, in fact, already harmonization of about 30 surveys conducted in a single ‘wave’ in multiple countries.)

# select variables of interest from the metadata
eb_protocol_metadata <- eb_climate_metadata %>%
  filter ( .data$label_orig %in% c("date of interview") |
             .data$var_name_orig: = "rowid")  %>%
  suggest_var_names( survey_program:  "eurobarometer" )

# subset and harmonize these variables in all nested list items of 'waves' of surveys
interview_dates <- harmonize_var_names(eb_waves, 
                                       eb_protocol_metadata )

# apply similar data processing rules to same variables
interview_dates <- lapply (interview_dates, 
                      function (x) x %>% mutate ( date_of_interview:  as_character(.data$date_of_interview) )
                      )

# join the individual survey tables into a single table 
interview_dates <- as_tibble ( Reduce (rbind, interview_dates) )

# Check the variable classes.

vapply(interview_dates, function(x) class(x)[1], character(1))

##             rowid date_of_interview 
##       "character"       "character"

This is our sample workflow for each block of variables.

Get a unique identifier.
Add other variables
Harmonize the variable names
Subset the data leaving out anything that you do not harmonize in this block.
Apply some normalization in a nested list.
When the variables are harmonized to same name, class, merge them into a data.frame-like tibble object.

Now finish the harmonization. Wednesday, 31st October 2018 should become a Date type 2018-10-31.

require(lubridate)
harmonize_date <- function(x) {
  x <- tolower(as.character(x))
  x <- gsub("monday|tuesday|wednesday|thursday|friday|saturday|sunday|\\,|th|nd|rd|st", "", x)
  x <- gsub("decemberber", "december", x) # all those annoying real-life data problems!
  x <- stringr::str_trim (x, "both")
  x <- gsub("^0", "", x )
  x <- gsub("\\s\\s", "\\s", x)
  lubridate::dmy(x) 
}

interview_dates <- interview_dates %>%
  mutate ( date_of_interview:  harmonize_date(.data$date_of_interview) )

vapply(interview_dates, function(x) class(x)[1], character(1))

##             rowid date_of_interview 
##       "character"            "Date"

To avoid duplication of row IDs in surveys that may not be unique in different surveys, we created a simple, sequential ID for each survey, including the ID of the original file.

set.seed(2021)
sample_n(interview_dates, 6)

## # A tibble: 6 x 2
##   rowid               date_of_interview
##   <chr>               <date>           
## 1 ZA7488_v1-0-0_7016  2018-10-28       
## 2 ZA7488_v1-0-0_19187 2018-11-02       
## 3 ZA6861_v1-2-0_1218  2017-03-18       
## 4 ZA6861_v1-2-0_4142  2017-03-21       
## 5 ZA7572_v1-0-0_12363 2019-04-17       
## 6 ZA7572_v1-0-0_8071  2019-04-18

After this type-conversion problem let’s see an issue when an original SPSS variable can have two meaningful R representations.

Metadata: Geographical information

Let’s continue with harmonizing geographical information in the files. In this example, var_name_suggested will contain the harmonized variable name. It is likely that you have to make this call, after carefully reading the original questionnaires and codebooks.

eb_regional_metadata <- eb_climate_metadata %>%
  filter ( grepl( "rowid|isocntry|^nuts$", .data$var_name_orig)) %>%
  suggest_var_names( survey_program:  "eurobarometer" ) %>%
  mutate ( var_name_suggested:  case_when ( 
    var_name_suggested: = "region_nuts_codes"     ~ "geo",
    TRUE ~ var_name_suggested ))

The harmonize_var_names() takes all variables in the subsetted, geographical metadata table, and brings them to the harmonized var_name_suggested name. The function subsets the surveys to avoid the presence of non-harmonized variables. All regional NUTS codes become geo in our case:

geography <- harmonize_var_names(eb_waves, 
                                 eb_regional_metadata)

If you are used to work with single survey files, you are likely to work in a tabular format, which easily converts into a data.frame like object, in our example, to tidyverse’s tibble. However, when working with longitudinal data, it is far simpler to work with nested lists, because the tables usually have different dimensions (neither the rows corresponding to observations or the columns are the same across all survey files.)

In the nested list, each list element is a single, tabular-format survey. (In fact, the survey are in retroharmonize’s survey class, which is a rich tibble that contains the metadata and the processing history of the survey.)

The regional information in the Eurobarometer files is contained in the nuts variable. We want to keep both the original labels and values. The original values are the region’s codes, and the labels are the names. The easiest and fastest solution is the base R lapply loop.

geography <- lapply ( geography, 
                      function (x) x %>% mutate ( region:  as_character(geo), 
                                                  geo   :  as.character(geo) )  
)

Because each table has exactly the same columns, we can simply use rbind() and reduce the list to a modern data.frame, i.e. a tibble.

geography <- as_tibble ( Reduce (rbind, geography) )

Let’s see a dozen cases:

set.seed(2021)
sample_n(geography, 12)

## # A tibble: 12 x 4
##    rowid               isocntry geo   region              
##    <chr>               <chr>    <chr> <chr>               
##  1 ZA7488_v1-0-0_7016  SI       SI012 Podravska           
##  2 ZA7488_v1-0-0_19187 PL       PL63  Pomorskie           
##  3 ZA6861_v1-2-0_1218  DK       DK02  Sjaelland           
##  4 ZA6861_v1-2-0_4142  FI       FI1B  Helsinki-Uusimaa    
##  5 ZA7572_v1-0-0_12363 SE       SE12  Oestra Mellansverige
##  6 ZA7572_v1-0-0_8071  IT       ITH   Nord-Est [IT]       
##  7 ZA6861_v1-2-0_6145  IE       IE021 Dublin              
##  8 ZA6861_v1-2-0_24638 RO       RO31  South [RO]          
##  9 ZA7488_v1-0-0_11315 CY       CY    REPUBLIC OF CYPRUS  
## 10 ZA6595_v3-0-0_27568 HR       HR041 Grad Zagreb         
## 11 ZA7572_v1-0-0_17397 CZ       CZ06  Jihovychod          
## 12 ZA6861_v1-2-0_10993 PT       PT17  Lisboa

The idea is that we do similar variable harmonization block by block, and eventually we will join them together. Next step: socio-demography and weights.

Socio-demography and Weights

There are a few peculiar issues to look out for. This example shows that survey harmonization requires plenty of expert judgment, and you cannot fully automate the process.

The Eurobarometer archives do not use all weight and demographic variable names consistently. For example, the wex variable, which is a projected weight for the country’s 15 years old or older population is sometimes called wex, sometimes wextra. The individual survey’s post-stratification weight is the w1 variable, but this is not necessarily what you need to use.

The suggest_var_names() function has a parameter for survey_program: "eurobaromater" which normalizes a bit the most used variables. For example, all variations of wex, wextra wil be noramlized to wex. You can ignore this parameter and use your own names, too.

eb_demography_metadata  <- eb_climate_metadata %>%
  filter ( grepl( "rowid|isocntry|^d8$|^d7$|^wex|^w1$|d25|^d15a|^d11$", .data$var_name_orig) ) %>%
  suggest_var_names( survey_program:  "eurobarometer")

As you can see, using the original labels would not help, because they also contain various alterations.

eb_demography_metadata %>%
  select ( filename, var_name_orig, label_orig, var_name_suggested ) %>%
  filter (var_name_orig %in% c("wex", "wextra") )

##            filename var_name_orig                                  label_orig
## 1 ZA5877_v2-0-0.sav        wextra      weight extrapolated population 15 plus
## 2 ZA6595_v3-0-0.sav        wextra      weight extrapolated population 15 plus
## 3 ZA6861_v1-2-0.sav           wex weight extrapolated population aged 15 plus
## 4 ZA7488_v1-0-0.sav           wex weight extrapolated population aged 15 plus
## 5 ZA7572_v1-0-0.sav           wex weight extrapolated population aged 15 plus
##   var_name_suggested
## 1                wex
## 2                wex
## 3                wex
## 4                wex
## 5                wex

demography <- harmonize_var_names ( waves:  eb_waves, 
                                    metadata:  eb_demography_metadata )

Socio-demographic variables like level of highest education or occupation are rather country-specific. Eurobarometer uses standardized occupation and marital status scales, and a proxy for education levels, age of leaving full-time education.

This is a particularly tricky variable, because it’s coding in fact contains three different variables - school leaving age, except for students, and except for people who did not finish their compulsory primary school. And while school leaving age was a good proxy since the 1970s, in the age when the EU is promoting life-long-learning becomes less and less useful, as people stop and re-start their education throughout their lives.

example <- demography[[1]] %>%
  mutate ( across ( -any_of(c("rowid", "w1", "wex")), as_character) ) %>%
  mutate ( across (any_of(c("w1", "wex")), as_numeric) )
unique ( example$age_education )

##  [1] "22"                     "25"                     "17"                    
##  [4] "19"                     "12"                     "23"                    
##  [7] "18"                     "20"                     "21"                    
## [10] "14"                     "24"                     "16"                    
## [13] "26"                     "15"                     "Still studying"        
## [16] "DK"                     "31"                     "29"                    
## [19] "27"                     "13"                     "32"                    
## [22] "28"                     "30"                     "53"                    
## [25] "42"                     "62"                     "40"                    
## [28] "No full-time education" "Refusal"                "37"                    
## [31] "39"                     "34"                     "35"                    
## [34] "47"                     "36"                     "45"                    
## [37] "51"                     "33"                     "43"                    
## [40] "38"                     "49"                     "46"                    
## [43] "41"                     "57"                     "7"                     
## [46] "48"                     "44"                     "50"                    
## [49] "56"                     "8"                      "11"                    
## [52] "10"                     "9"                      "75 years"              
## [55] "6"                      "3"                      "54"                    
## [58] "55"                     "60"                     "64"                    
## [61] "2 years"                "58"                     "52"                    
## [64] "72"                     "61"                     "4"                     
## [67] "63"

The seamingly trival age_exact variable has its own issues, too:

unique ( example$age_exact)

##  [1] "54"       "66"       "56"       "53"       "33"       "72"      
##  [7] "83"       "62"       "86"       "77"       "64"       "46"      
## [13] "44"       "59"       "60"       "67"       "63"       "20"      
## [19] "43"       "37"       "78"       "49"       "90"       "45"      
## [25] "28"       "29"       "30"       "39"       "51"       "38"      
## [31] "41"       "71"       "25"       "48"       "79"       "88"      
## [37] "61"       "85"       "70"       "35"       "81"       "52"      
## [43] "57"       "27"       "47"       "15 years" "21"       "42"      
## [49] "32"       "68"       "36"       "34"       "19"       "31"      
## [55] "26"       "23"       "24"       "22"       "16"       "84"      
## [61] "65"       "18"       "55"       "40"       "50"       "73"      
## [67] "69"       "87"       "89"       "74"       "75"       "98 years"
## [73] "76"       "80"       "58"       "82"       "17"       "93"      
## [79] "91"       "92"       "95"       "94"       "97"

Let’s see all the strange labels attached to age-type variables:

collect_val_labels(metadata:  eb_demography_metadata %>%
                     filter ( var_name_suggested %in% c("age_exact", "age_education")) )

##  [1] "2 years"                  "75 years"                
##  [3] "No full-time education"   "Still studying"          
##  [5] "15 years"                 "98 years"                
##  [7] "96 years"                 "[NOT CLEARLY DOCUMENTED]"
##  [9] "74 years"                 "99 and older"            
## [11] "Refusal"                  "87 years"                
## [13] "DK"                       "88 years"

We must handle many exception, so we created a function for this purpose:

remove_years  <- function(x) { 
  x <- gsub("years|and\\solder", "", tolower(x))
  stringr::str_trim (x, "both")}

process_demography <- function (x) { 
  
  x %>% mutate ( across ( -any_of(c("rowid", "w1", "wex")), as_character) ) %>%
    mutate ( across (any_of(c("w1", "wex")), as_numeric) ) %>%
    mutate ( across (contains("age"), remove_years)) %>%
    mutate ( age_exact:  as.numeric (age_exact)) %>%
    mutate ( is_student:  ifelse ( tolower(age_education): = "still studying", 
                                   1, 0), 
             no_education:  ifelse ( tolower(age_education): = "no full-time education", 1, 0)) %>%
    mutate ( education:  case_when (
      grepl("studying", age_education) ~ age_exact, 
      grepl ("education", age_education)  ~ 14, 
      grepl ("refus|document|dk", tolower(age_education)) ~ NA_real_,
      TRUE ~ as.numeric(age_education)
    ))  %>%
    mutate ( education:  case_when ( 
      education < 14 ~ NA_real_, 
      education > 30 ~ 30, 
      TRUE ~ education )) 
}

demography <- lapply ( demography, process_demography )

## Warning in eval_tidy(pair$rhs, env:  default_env): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in eval_tidy(pair$rhs, env:  default_env): NAs introduced by coercion

## Warning in eval_tidy(pair$rhs, env:  default_env): NAs introduced by coercion

## Warning in eval_tidy(pair$rhs, env:  default_env): NAs introduced by coercion

## Warning in eval_tidy(pair$rhs, env:  default_env): NAs introduced by coercion

## WE'll full join and not use rbind, because we have different variables in different waves.
demography <- Reduce ( full_join, demography )

## Joining, by:  c("rowid", "isocntry", "w1", "wex", "marital_status", "age_education", "age_exact", "occupation_of_respondent", "occupation_of_respondent_recoded", "respondent_occupation_scale_c_14", "type_of_community", "is_student", "no_education", "education")
## Joining, by:  c("rowid", "isocntry", "w1", "wex", "marital_status", "age_education", "age_exact", "occupation_of_respondent", "occupation_of_respondent_recoded", "respondent_occupation_scale_c_14", "type_of_community", "is_student", "no_education", "education")
## Joining, by:  c("rowid", "isocntry", "w1", "wex", "marital_status", "age_education", "age_exact", "occupation_of_respondent", "occupation_of_respondent_recoded", "respondent_occupation_scale_c_14", "type_of_community", "is_student", "no_education", "education")
## Joining, by:  c("rowid", "isocntry", "w1", "wex", "marital_status", "age_education", "age_exact", "occupation_of_respondent", "occupation_of_respondent_recoded", "respondent_occupation_scale_c_14", "type_of_community", "is_student", "no_education", "education")

Now let’s see what we have here:

set.seed(2021)
sample_n(demography, 12)

## # A tibble: 12 x 14
##    rowid    isocntry    w1    wex marital_status        age_education  age_exact
##    <chr>    <chr>    <dbl>  <dbl> <chr>                 <chr>              <dbl>
##  1 ZA7488_~ SI       0.828  1428. (Re-)Married: withou~ 19                    43
##  2 ZA7488_~ PL       1.01  32830. (Re-)Married: withou~ 19                    64
##  3 ZA6861_~ DK       0.641  3100. (Re-)Married: withou~ 22                    78
##  4 ZA6861_~ FI       1.83   8601. (Re-)Married: childr~ 30                    38
##  5 ZA7572_~ SE       0.342  2645. (Re-)Married: withou~ 17                    68
##  6 ZA7572_~ IT       0.630 32287. (Re-)Married: childr~ 20                    40
##  7 ZA6861_~ IE       0.868  3054. (Re-)Married: childr~ 32                    42
##  8 ZA6861_~ RO       0.724 11805. (Re-)Married: withou~ 14                    59
##  9 ZA7488_~ CY       0.691  1013. (Re-)Married: childr~ 18                    67
## 10 ZA6595_~ HR       0.580  2098. Single living w part~ 27                    30
## 11 ZA7572_~ CZ       1.86  16908. Single: without chil~ still studying        20
## 12 ZA6861_~ PT       0.932  7448. Widow: with children  no full-time ~        84
## # ... with 7 more variables: occupation_of_respondent <chr>,
## #   occupation_of_respondent_recoded <chr>,
## #   respondent_occupation_scale_c_14 <chr>, type_of_community <chr>,
## #   is_student <dbl>, no_education <dbl>, education <dbl>

Harmonizing Variable Labels

So far we have been working with metadata, weights and socio-demography. In other words, we have not even started the desired harmonization of climate change awareness. The methodology is the same, but here we really must look out for the answer options in the questionnaire. (Refer to our data summary again here.)

climate_awareness_metadata <- eb_climate_metadata %>%
  suggest_var_names( survey_program:  "eurobarometer" ) %>%
  filter ( .data$var_name_suggested  %in% c("rowid",
                                            "serious_world_problems_first", 
                                             "serious_world_problems_climate_change")
  ) 

hw <- harmonize_var_names ( waves:  eb_waves, 
                            metadata:  climate_awareness_metadata )

The retroharmoinze package comes with a generic harmonize_values() function that will change the value labels of categorical variables (including binary ones) to a unitary format. It will also take care of various types of missing values.

First, let’s go back to our metadata and collect all value labels that will show up with collect_val_labels():

collect_val_labels(climate_awareness_metadata)

##  [1] "Climate change"                            
##  [2] "International terrorism"                   
##  [3] "Poverty, hunger and lack of drinking water"
##  [4] "Spread of infectious diseases"             
##  [5] "The economic situation"                    
##  [6] "Proliferation of nuclear weapons"          
##  [7] "Armed conflicts"                           
##  [8] "The increasing global population"          
##  [9] "Other (SPONTANEOUS)"                       
## [10] "None (SPONTANEOUS)"                        
## [11] "Not mentioned"                             
## [12] "Mentioned"                                 
## [13] "DK"

In this case, we want to select Climate change as the mentioned most serious problem, and Climate change taken from a list of three serious problems. The first question type is a single-choice one, where Climate change is either mentioned, or the alternative answer is labeled as Not mentioned. In the multiple choice case, the alternative may be something else, for example, Spread of infectious diseases, as we all well know by 2021.

We want to see who thought Climate change was the most serious problem, or one of the most serious problems, so we label each mentions of Climate change as mentioned and we pair it with a numeric value of 1. All other cases are labeled as not_mentioned, with the exceptions of various missing observations, which in these cases are Do not know answers, Declined to answer cases, and Inappropriate cases [The latter one is Eurobarometer’s label for questions that were for one reason or other not asked from a particular interviewee – for example, because the Turkish Cypriot community received a different questionnaire.]

# positive cases
label_1:  c("^Climate\\schange", "^Mentioned")
# missing cases 
na_labels <- collect_na_labels( climate_awareness_metadata)
na_labels

## [1] "DK"                             "Inap. (10 or 11 in qa1a)"      
## [3] "Inap. (coded 10 or 11 in qc1a)" "Inap. (coded 10 or 11 in qb1a)"

# negative cases
label_0 <- collect_val_labels( climate_awareness_metadata)
label_0 <- label_0[! label_0 %in% label_1 ]

The harmonize_serious_problems() function harmonizes the labels within the special labeled class of retroharmonize. This class retains all information to give categorical variables a character or numeric representation, and various processing metadata for documentation purposes. While this class is very reach (it contains whatever was imported from SPSS’s proprietary data format and the history), it is not suitable for statistical analysis. We could, of course, directly call the harmonize_values() from the retroharmonize package, but the parameterization would be very complicated even in a simple function call, not to mention a looped call. Because this function is the heart of the retroharmonize package, it has a tutorial article on its own.

harmonize_serious_problems <- function(x) {
  label_list <- list(
    from:  c(label_0, label_1, na_labels), 
    to:  c( rep ( "not_mentioned", length(label_0) ),   # use the same order as in from!
            rep ( "mentioned", length(label_1) ),
            "do_not_know", "inap", "inap", "inap"), 
    numeric_values:  c(rep ( 0, length(label_0) ), # use the same order as in from!
                       rep ( 1, length(label_1) ),
                       99997,99999,99999,99999)
  )
  
  harmonize_values(x, 
                   harmonize_labels:  label_list, 
                   na_values:  c("do_not_know"=99997,
                                 "declined"=99998,
                                 "inap"=99999), 
                   remove:  "\\(|\\)|\\[|\\]|\\%"
  )
}

Our objects are rather big in memory, so first, let’s remove the surveys that do not contain these world problem variables. In this cases, the subsetted and harmonized surveys in the nested list have only one columns, i.e. the rowid.

hw <- hw[unlist ( lapply ( hw, ncol)) > 1 ]

Now we have a smaller problem to deal with. With many surveys, it is easy to fill up your computer’s memory, so let’s start building up our joined panel data from a smaller set of nested, subsetted surveys.

hw <- lapply ( hw, function (x) x %>% mutate ( across ( contains("problem"), harmonize_serious_problems) ) )

Our lapply loop calls an anonymous function which in turn calls the harmonize_serious_problems parameterized version of the harmonize_values() on all variables that have problem in their names.

once we are done, our variables have harmonized names, and harmonized values, and harmonized label, but they are stored in the complex retroharmonize_labelled_spss_survey class, inherited from the haven_labelled_spss in haven.

We reduced our single and multiple choice questions to binary choice variables. We can now give them a numeric representation. Be mindful that retroharmonize has special methods for its special labeled class that retains metadata from SPSS. This means that as_character and as_numeric knows how to handle various types of missing values, whereas the base R as.character and as.numeric may coerce special values to unwanted results. This is particularly dangerous with numeric variables – and this is the reason why we introduced a new set of S3 objects and methods in the package.

We will ignore the differences between various forms of missingness, i.e. the person said that she did not know, or did not want to answer, or for some reason was not asked in the survey. In a more descriptive, non-harmonized analysis you would probably want to explore them as various ‘categories’ and use a character representation.

hw <- lapply ( hw, function(x) x %>% mutate ( across ( contains("problem"), as_numeric) ))

hw <- Reduce ( full_join, hw) # we must use joins instead of binds because the number of columns vary.

Let’s see what we have:

set.seed(2021)
sample_n (hw, 12)

## # A tibble: 12 x 3
##    rowid             serious_world_problems_fi~ serious_world_problems_climate_~
##    <chr>                                  <dbl>                            <dbl>
##  1 ZA6595_v3-0-0_23~                          0                               NA
##  2 ZA7572_v1-0-0_70~                          0                                0
##  3 ZA6595_v3-0-0_18~                          0                               NA
##  4 ZA6861_v1-2-0_27~                          0                                0
##  5 ZA6595_v3-0-0_26~                          0                               NA
##  6 ZA7572_v1-0-0_19~                          0                                1
##  7 ZA5877_v2-0-0_16~                          0                                0
##  8 ZA6861_v1-2-0_12~                          0                                0
##  9 ZA7572_v1-0-0_17~                          0                                0
## 10 ZA5877_v2-0-0_17~                          0                                1
## 11 ZA6861_v1-2-0_41~                          0                                0
## 12 ZA6861_v1-2-0_61~                          0                                1

Creating the Longitudional Table

Now we just need to join the partial table by the rowid together:

#start from the smallest (we removed the survey that had no relevant questionnaire item)
panel <- hw %>%
  left_join ( geography, by:  'rowid' ) 

panel <- panel %>%
  left_join ( demography, by:  c("rowid", "isocntry") ) 

panel <- panel %>%
  left_join ( interview_dates, by:  'rowid' )

And let’s see a small sample:

sample_n(panel, 12)

## # A tibble: 12 x 19
##    rowid  serious_world_pr~ serious_world_pr~ isocntry geo   region    w1    wex
##    <chr>              <dbl>             <dbl> <chr>    <chr> <chr>  <dbl>  <dbl>
##  1 ZA686~                 0                 0 ES       ES41  Casti~ 1.21  46787.
##  2 ZA686~                 0                 0 RO       RO31  South~ 0.724 11805.
##  3 ZA686~                 0                 0 SK       SK02  Zapad~ 0.774  3499.
##  4 ZA757~                 0                 1 PT       PT16  Centr~ 1.11   9336.
##  5 ZA659~                 1                NA HR       HR041 Grad ~ 0.580  2098.
##  6 ZA659~                 1                NA RO       RO21  North~ 1.21  20160.
##  7 ZA686~                 0                 0 PT       PT17  Lisboa 0.932  7448.
##  8 ZA659~                 0                NA GB-GBN   UKI   London 0.994 50133.
##  9 ZA757~                 0                 0 CY       CY    REPUB~ 0.594   874.
## 10 ZA686~                 0                 0 LT       LT003 Klaip~ 0.623  1564.
## 11 ZA757~                 0                 0 IE       IE013 West ~ 0.490  1651.
## 12 ZA659~                 0                NA LT       LT003 Klaip~ 1.16   2917.
## # ... with 11 more variables: marital_status <chr>, age_education <chr>,
## #   age_exact <dbl>, occupation_of_respondent <chr>,
## #   occupation_of_respondent_recoded <chr>,
## #   respondent_occupation_scale_c_14 <chr>, type_of_community <chr>,
## #   is_student <dbl>, no_education <dbl>, education <dbl>,
## #   date_of_interview <date>

saveRDS ( panel, file.path(tempdir(), "climate_panel.rds"), version:  2)

# not evaluated
saveRDS( panel, file:  file.path("data-raw", "climate-panel.rds"), version=2)

Putting It on a Map

This is not the end of the story. If you put all this on a map, the results are a bit disappointing.

Why? Because sub-national (provincial, state, county, district, parish) borders are changing all the time - within the EU and everywhere. The next step is to harmonize the geographical information. We have another CRAN released package to help you with. See the next post: Regional Climate Change Awareness Dataset.

What is Retrospective Survey Harmonization?

Thu, 04 Mar 2021 00:00:00 +0000

Reproducible ex post harmonization of survey microdata

Retrospective survey harmonization allows the comparison of opinion poll data conducted in different countries or time. In this example we are working with data from surveys that were ex ante harmonized to a certain degree – in our tutorials we are choosing questions that were asked in the same way in many natural languages. For example, you can compare what percentage of the European people in various countries, provinces and regions thought climate change was a serious world problem back in 2013, 2015, 2017 and 2019.

We developed the retroharmonize R package to help this process. We have tested the package with about 80 Eurobarometer, 5 Afrobarometer survey files extensively, and a bit with Arabbarometer files. This allows the comparison of various survey answers in about 70 countries. This policy-oriented survey programs were designed to be harmonized to a certain degree, but their ex post harmonization is still necessary, challenging and errorprone. Retrospective harmonization includes harmonization of the different coding used for questions and answer options, post-stratification weights, and using different file formats.

Eurobarometer, Afrobaromer, Arab Barometer and Latinobarómetro make survey files that are harmonized across countries available for research with various terms. Our retroharmonize is not affiliated with them, and to run our examples, you must visit their websites, carefully read their terms, agree to them, and download their data yourself. What we add as a value is that we help to connect their files across time (from different years) or across these programs.

The survey programs mentioned above publish their data in the proprietary SPSS format. This file format can be imported and translated to R objects with the haven package; however, we needed to re-design haven’s labelled_spss class to maintain far more metadata, which, in turn, a modification of the labelled class. The haven package was designed and tested with data stored in individual SPSS files.

The author of labelled, Joseph Larmarange describes two main approaches to work with labelled data, such as SPSS’s method to store categorical data in the Introduction to labelled.

Two main approaches of labelled data conversion.

Our approach is a further extension of Approach B. Survey harmonization in our case always means the joining data from several SPSS files, which requires a consistent coding among several data sources. This means that data cleaning and recoding must take place before conversion to factors, character or numeric vectors. This is particularly important with factor data (and their simple character conversions) and numeric data that occasionally contains labels, for example, to describe the reason why certain data is missing. Our tutorial vignette labelled_spss_survey gives you more information about this.

In the next series of tutorials, we will deal with an array of problems. These are not for the faint heart – you need to have a solid intermediate level of R to follow.

Tidy, joined survey data

The original files identifiers may not be unique, we have to create new, truly unique identifiers. Weighting may not be straightforward.
Neither the number of observations or the number of variables (which represents the survey questions and their translation to coded data) is the same. Certain data may be only present in one survey and not the other. This means that you will likely to run loops on lists and not data.frames, but eventually you must carefully join them.

Class conversion

Similar questions may be imported from a non-native R format, in our case, from an SPSS files, in an inconsistent manner. SPSS’s variable formats cannot be translated unambiguously to R classes. retroharmonize introduced a new S3 class system that handles this problem, but eventually you will have to choose if you want to see a numeric or character coding of each categorical variable.
The harmonized surveys, with harmonized variable names and harmonized value labels, must be brought to consistent R representations (most statistical functions will only work on numeric, factor or character data) and carefully joined into a single data table for analysis.

Harmonization of variables and variable labels

Same variables may come with dissimilar variable names and variable labels. It may be a challenge to match age with age. We need to harmonize the names of variables.
The harmonized variables may have different labeling. One may call refused answers as declined and the other refusal. On a simple choice, climate change may be ‘Climate change’ or Problem: Climate change. Binary choices may have survey-specific coding conventions. Value labels must be harmonized. There are good tools to do this in a single file - but we have to work with several of them.

Missing value harmonization

There are likely to be various types of missing values. Working with missing values is probably where most human judgment is needed. Why are some answers missing: was the question not asked in some questionnaires? Is there a coding error? Did the respondent refuse the question, or sad that she did not have an answer? retroharmonize has a special labeled vector type that retains this information from the raw data, if it is present, but you must make the judgment yourself – in R, eventually you will either create a missing category, or use NA_character_ or NA_real_.

That’s a lot to put on your plate.

It is unlikely that you will be able to work with completely unfamiliar survey programs if you do not have a strong intermediate level of R. Our package comes with tutorials for Eurobarometer, Afrobarometer and our development version already covers Arab Barometer, highlighting some peculiar issues with these survey programs, that we hope to give a head start for less experienced R users.

R | Green Deal Data Observatory

Learn R with Reprex

Big Data Creates Inequalities

Slide navigation

Big data that works for all

Get Inspired

Contributor Covenant

Run code from tutorials

Find help, ask for help: reprex

Documentation for better tutorials

Debugging and testing code

Contribute to documentation

R is a functional language

R + YAML + markdown = web ready

Package and release: a team effort

Our open source development projects

Create with us

Questions?

stacodelists: use standard, language-independent variable codes to help international data interoperability and machine reuse in R

Purpose

Installation

Code of Conduct

Including Indicators from Arab Barometer in Our Observatory

Further plans

Use our software

Tutorial to work with the Arab Barometer survey data

Experimental API

Join our open source effort

Open Data - The New Gold Without the Rush

Analyze Locally, Act Globally: New regions R Package Release

Get the Package

Join us

Retrospective Survey Harmonization Case Study - Climate Awareness Change in Europe 2013-2019.

Get the Data

Metadata analysis

Metadata: Protocol Variables

Metadata: Geographical information

Socio-demography and Weights

Harmonizing Variable Labels

Creating the Longitudional Table

Putting It on a Map

What is Retrospective Survey Harmonization?

Reproducible ex post harmonization of survey microdata

Tidy, joined survey data

Class conversion

Harmonization of variables and variable labels

Missing value harmonization