DataSHIELD upload tools

Background

The aim of the EU Child Cohort Network is to bring together data from existing child cohorts into one open and sustainable, multi-disciplinary network. To ensure sure that the EU Child Cohort Network is both open and sustainable, the consortia are using the data-sharing platform DataSHIELD.

Participating cohorts must harmonise data, following the data harmonisation manuals. Secondly, perform quality-control checks on their harmonised data. Thirdly, upload descriptions of harmonisation to the cohort catalogue. The last step is uploading the data into the DataSHIELD backends. The guide below guides you through the process.

Installation

Reference documentation:

You can install the package by executing the following commands:

Armadillo 3

dsUpload Version 5.x.x is compatible with Armadillo 3. You should be able to install dsUpload without specifying any additional versions.

Step 1: install devtools

install.packages("devtools")

Step 2: load devtools and install ds-upload

library(devtools)
devtools::install_github("lifecycle-project/ds-upload")

Armadillo 2

dsUpload Version 4.7.x is compatible with Armadillo 2. When installing this version of dsUpload, the install.packages command might install the newest version (incompatible) of MolgenisArmadillo. Run these commands (Rstudio) to install the correct version of MolgenisArmadillo:

Install devtools:

install.packages("devtools")

Load devtools and install ds-upload 4.7.1

library(devtools)
devtools::install_github("lifecycle-project/ds-upload@4.7.1")

You might get the following error message:

namespace ‘MolgenisArmadillo’ 2.0.0 is being loaded, but == 1.1.3 is required

To fix this you need to remove the incompatible version of MolgenisArmadillo:

unloadNamespace("MolgenisArmadillo")
remove.packages("MolgenisArmadillo")

You might have to install these additional packages:

`install.packages(c("aws.iam", "aws.s3"))

Next we install a previous version of MolgenisArmadillo 1.1.3:

packageurl <- "https://cran.r-project.org/src/contrib/Archive/MolgenisArmadillo/MolgenisArmadillo_1.1.3.tar.gz"
install.packages(packageurl, repos=NULL, type="source")

Now we (again) install dsUpload 4.7.1:

devtools::install_github("lifecycle-project/ds-upload@4.7.1")

Make sure you do NOT update MolgenisArmadillo to another version then 1.1.3, if prompted, select option 3

Downloading GitHub repo lifecycle-project/ds-upload@4.7.1
These packages have more recent versions available.
It is recommended to update all of them.
Which would you like to update?

1: All                                 
2: CRAN packages only                  
3: None                                
4: MolgenisA... (1.1.3 -> 2.0.0) [CRAN]

Enter one or more numbers, or an empty line to skip updates: 3

After that you should be able to load dsUpload without any problems.

If you are asked to update MolgenisArmadillo to version 2.0.x please skip, this in order for dsUpload 4.7.x to work with Armadillo 2.

Run dsUpload

When you want to use it you need to load it.

# load the package
library(dsUpload)
#> Loading required package: DSI
#> Loading required package: progress
#> Loading required package: R6

Troubleshooting

Check: troubleshooting guide

Usage

To simplify the upload and importing of data dictionaries this package is written to import and upload the data dictionaries and data in one run. When running the package, you need to specify the data dictionary version and the data input file. When you use data formats other than CSV use need to specify the data format as well

Prerequisites

  • Upload core variables

    Merge all the variables that are obtained in the dictionary of the core variables. So in general that means merge the data of WP1 and WP3 into one set.

  • Upload outcome variables

    Merge all the variables that are obtained in the dictionary of the outcome variables. So in general this means merge the data of WP4, WP5 and WP6 into one set.

Using Armadillo

Please following the instruction below to upload the core and outcome variables in the Armadillo.

Login block Armadillo 3
login_data <- data.frame(
  server = "https://armadillo3.test.molgenis.org",
  driver = "ArmadilloDriver")
Login block Armadillo 2
login_data <- data.frame(
  server = "https://armadillo.test.molgenis.org", 
  storage = "https://armadillo-minio.test.molgenis.org", 
  driver = "ArmadilloDriver")
Login to Armadillo server
# login to the Armadillo server
du.login(login_data = login_data)
#> ***********************************************************************************
#>   [WARNING] You are not running the latest version of the dsUpload-package.
#>   [WARNING] If you want to upgrade to newest version : [ 4.0.6 ],
#>   [WARNING] Please run 'install.packages("dsUpload", repos = "https://registry.molgenis.org/repository/R/")'
#>   [WARNING] Check the release notes here: https://github.com/lifecycle-project/analysis-protocols/releases/tag/4.0.6
#> ***********************************************************************************
#>   Login to: "https://armadillo.test.molgenis.org"
#> [1] "We're opening a browser so you can log in with code GFL6Q6"
#>   Logged on to: "https://armadillo.test.molgenis.org"
# upload the data into the DataSHIELD backend
# these are the core variables
# be advised the default input format is 'CSV'
# you can use STATA, SPSS, SAS, CSV's or R as source files
du.upload(
  cohort_id = 'gecko', 
  dict_version = '2_1', 
  dict_kind = 'core', 
  data_version = '1_0', 
  data_input_format = 'CSV',
  data_input_path = 'https://github.com/lifecycle-project/ds-upload/blob/master/inst/examples/data/WP1/data/all_measurements_v1_2.csv?raw=true',
  run_mode = "non_interactive"
)
#> ***********************************************************************************
#>   [WARNING] You are not running the latest version of the dsUpload-package.
#>   [WARNING] If you want to upgrade to newest version : [ 4.0.6 ],
#>   [WARNING] Please run 'install.packages("dsUpload", repos = "https://registry.molgenis.org/repository/R/")'
#>   [WARNING] Check the release notes here: https://github.com/lifecycle-project/analysis-protocols/releases/tag/4.0.6
#> ***********************************************************************************
#> ######################################################
#>   Start upload data into DataSHIELD backend
#> ------------------------------------------------------
#>  * Create temporary workdir
#> ######################################################
#>   Start download dictionaries
#> ------------------------------------------------------
#> * Download: [ 2_1_monthly_rep.xlsx ]
#> * Download: [ 2_1_non_rep.xlsx ]
#> * Download: [ 2_1_trimester_rep.xlsx ]
#> * Download: [ 2_1_yearly_rep.xlsx ]
#>   Successfully downloaded dictionaries
#> ######################################################
#>   Start importing data dictionaries
#> ######################################################
#>  * Check released dictionaries
#> * Project : gecko already exists
#> ######################################################
#>   Start converting and uploading data
#> ######################################################
#> * Setup: load data and set output directory
#> ------------------------------------------------------
#> [WARNING] This is an unmatched column, it will be dropped : [ art ].
#> * Generating: non-repeated measures
#> * Generating: yearly-repeated measures
#> Aggregate function missing, defaulting to 'length'
#> * Generating: monthly-repeated measures
#> Aggregate function missing, defaulting to 'length'
#> * Generating: trimesterly-repeated measures
#> Aggregate function missing, defaulting to 'length'
#> * Start importing: 2_1_core_1_0 into project: gecko
#> Compressing...
#> 
  |                                                                              
  |                                                                        |   0%
  |                                                                              
  |========================================================================| 100%
#> Uploaded 2_1_core_1_0/trimester
#> * Import finished successfully
#> * Start importing: 2_1_core_1_0 into project: gecko
#> Compressing...
#> 
  |                                                                              
  |                                                                        |   0%
  |                                                                              
  |==========                                                              |  15%
  |                                                                              
  |=====================                                                   |  29%
  |                                                                              
  |===============================                                         |  44%
  |                                                                              
  |==========================================                              |  58%
  |                                                                              
  |====================================================                    |  73%
  |                                                                              
  |===============================================================         |  87%
  |                                                                              
  |========================================================================| 100%
#> Uploaded 2_1_core_1_0/non_rep
#> * Import finished successfully
#> * Start importing: 2_1_core_1_0 into project: gecko
#> Compressing...
#> 
  |                                                                              
  |                                                                        |   0%
  |                                                                              
  |========================================================================| 100%
#> Uploaded 2_1_core_1_0/yearly_rep
#> * Import finished successfully
#> * Start importing: 2_1_core_1_0 into project: gecko
#> Compressing...
#> 
  |                                                                              
  |                                                                        |   0%
  |                                                                              
  |========================================================================| 100%
#> Uploaded 2_1_core_1_0/monthly_rep
#> * Import finished successfully
#> ######################################################
#>   Converting and import successfully finished
#> ######################################################
#>  * Reinstate default working directory
#>  * Cleanup temporary directory
# upload the outcome variables
du.upload(
  cohort_id = 'gecko', 
  dict_version = '1_1', 
  dict_kind = 'outcome', 
  data_version = '1_0', 
  data_input_format = 'CSV',
  data_input_path = 'https://github.com/lifecycle-project/ds-upload/blob/master/inst/examples/data/WP6/nd_data_wp6.csv?raw=true',
  run_mode = "non_interactive"
)
#> ***********************************************************************************
#>   [WARNING] You are not running the latest version of the dsUpload-package.
#>   [WARNING] If you want to upgrade to newest version : [ 4.0.6 ],
#>   [WARNING] Please run 'install.packages("dsUpload", repos = "https://registry.molgenis.org/repository/R/")'
#>   [WARNING] Check the release notes here: https://github.com/lifecycle-project/analysis-protocols/releases/tag/4.0.6
#> ***********************************************************************************
#> ######################################################
#>   Start upload data into DataSHIELD backend
#> ------------------------------------------------------
#>  * Create temporary workdir
#> ######################################################
#>   Start download dictionaries
#> ------------------------------------------------------
#> * Download: [ 1_1_monthly_rep.xlsx ]
#> * Download: [ 1_1_non_rep.xlsx ]
#> * Download: [ 1_1_weekly_rep.xlsx ]
#> * Download: [ 1_1_yearly_rep.xlsx ]
#>   Successfully downloaded dictionaries
#> ######################################################
#>   Start importing data dictionaries
#> ######################################################
#>  * Check released dictionaries
#> * Project : gecko already exists
#> ######################################################
#>   Start converting and uploading data
#> ######################################################
#> * Setup: load data and set output directory
#> ------------------------------------------------------
#> * Generating: non-repeated measures
#> * Generating: yearly-repeated measures
#> * Generating: monthly-repeated measures
#> * Generating: weekly-repeated measures
#> * Start importing: 1_1_outcome_1_0 into project: gecko
#> Compressing...
#> 
  |                                                                              
  |                                                                        |   0%
  |                                                                              
  |========================================================================| 100%
#> Uploaded 1_1_outcome_1_0/weekly_rep
#> * Import finished successfully
#> * Start importing: 1_1_outcome_1_0 into project: gecko
#> Compressing...
#> 
  |                                                                              
  |                                                                        |   0%
  |                                                                              
  |========================================================================| 100%
#> Uploaded 1_1_outcome_1_0/non_rep
#> * Import finished successfully
#> * Start importing: 1_1_outcome_1_0 into project: gecko
#> Compressing...
#> 
  |                                                                              
  |                                                                        |   0%
  |                                                                              
  |======                                                                  |   8%
  |                                                                              
  |===========                                                             |  16%
  |                                                                              
  |=================                                                       |  24%
  |                                                                              
  |=======================                                                 |  31%
  |                                                                              
  |============================                                            |  39%
  |                                                                              
  |==================================                                      |  47%
  |                                                                              
  |========================================                                |  55%
  |                                                                              
  |=============================================                           |  63%
  |                                                                              
  |===================================================                     |  71%
  |                                                                              
  |=========================================================               |  79%
  |                                                                              
  |==============================================================          |  86%
  |                                                                              
  |====================================================================    |  94%
  |                                                                              
  |========================================================================| 100%
#> Uploaded 1_1_outcome_1_0/yearly_rep
#> * Import finished successfully
#> * Start importing: 1_1_outcome_1_0 into project: gecko
#> Compressing...
#> 
  |                                                                              
  |                                                                        |   0%
  |                                                                              
  |========================================================================| 100%
#> Uploaded 1_1_outcome_1_0/monthly_rep
#> * Import finished successfully
#> ######################################################
#>   Converting and import successfully finished
#> ######################################################
#>  * Reinstate default working directory
#>  * Cleanup temporary directory

Using Opal

A video guiding you through the process can be found here:

Check youtube channel: upload data dictionaries and data into Opal

Alternatively, execute these commands in your R-console:

login_data <- data.frame(
  server = "https://opal.edge.molgenis.org", 
  user = "administrator", 
  password = "ouf0uPh6",
  driver = "OpalDriver")
# login to the DataSHIELD backend
du.login(login_data = login_data)
#> ***********************************************************************************
#>   [WARNING] You are not running the latest version of the dsUpload-package.
#>   [WARNING] If you want to upgrade to newest version : [ 4.0.6 ],
#>   [WARNING] Please run 'install.packages("dsUpload", repos = "https://registry.molgenis.org/repository/R/")'
#>   [WARNING] Check the release notes here: https://github.com/lifecycle-project/analysis-protocols/releases/tag/4.0.6
#> ***********************************************************************************
#>   Login to: "https://opal.edge.molgenis.org"
#>   Logged on to: "https://opal.edge.molgenis.org"
# upload the data into the DataSHIELD backend
# these are the core variables
# be advised the default input format is 'CSV'
# you can use STATA, SPSS, SAS and CSV's as source files
du.upload(
  cohort_id = 'gecko', 
  dict_version = '2_1', 
  dict_kind = 'core', 
  data_version = '1_0', 
  data_input_format = 'CSV',
  data_input_path = 'https://github.com/lifecycle-project/ds-upload/blob/master/inst/examples/data/WP1/data/all_measurements_v1_2.csv?raw=true',
  run_mode = "non_interactive"
)
#> ***********************************************************************************
#>   [WARNING] You are not running the latest version of the dsUpload-package.
#>   [WARNING] If you want to upgrade to newest version : [ 4.0.6 ],
#>   [WARNING] Please run 'install.packages("dsUpload", repos = "https://registry.molgenis.org/repository/R/")'
#>   [WARNING] Check the release notes here: https://github.com/lifecycle-project/analysis-protocols/releases/tag/4.0.6
#> ***********************************************************************************
#> ######################################################
#>   Start upload data into DataSHIELD backend
#> ------------------------------------------------------
#>  * Create temporary workdir
#> ######################################################
#>   Start download dictionaries
#> ------------------------------------------------------
#> * Download: [ 2_1_monthly_rep.xlsx ]
#> * Download: [ 2_1_non_rep.xlsx ]
#> * Download: [ 2_1_trimester_rep.xlsx ]
#> * Download: [ 2_1_yearly_rep.xlsx ]
#>   Successfully downloaded dictionaries
#> ######################################################
#>   Start importing data dictionaries
#> ######################################################
#>  * Check released dictionaries
#> ------------------------------------------------------
#>   Start creating project: [ lc_gecko_core_2_1 ]
#> * Project: [ lc_gecko_core_2_1 ] already exists
#> ------------------------------------------------------
#>   Start importing dictionaries
#> * Table: [ 1_0_monthly_rep ] already exists
#> * Import variables into: [ 1_0_monthly_rep ]
#> * Table: [ 1_0_non_rep ] already exists
#> * Matched categories for table: [ 1_0_non_rep ]
#> * Import variables into: [ 1_0_non_rep ]
#> * Table: [ 1_0_trimester_rep ] already exists
#> * Matched categories for table: [ 1_0_trimester_rep ]
#> * Import variables into: [ 1_0_trimester_rep ]
#> * Table: [ 1_0_yearly_rep ] already exists
#> * Matched categories for table: [ 1_0_yearly_rep ]
#> * Import variables into: [ 1_0_yearly_rep ]
#>   All dictionaries are populated correctly
#> ######################################################
#>   Start converting and uploading data
#> ######################################################
#> * Setup: load data and set output directory
#> ------------------------------------------------------
#> [WARNING] This is an unmatched column, it will be dropped : [ art ].
#> * Generating: non-repeated measures
#> * Generating: yearly-repeated measures
#> Aggregate function missing, defaulting to 'length'
#> * Generating: monthly-repeated measures
#> Aggregate function missing, defaulting to 'length'
#> * Generating: trimesterly-repeated measures
#> Aggregate function missing, defaulting to 'length'
#> * Upload: [ 2021-01-29_11-40-58_1_0_trimester_repeated_measures.csv ] to directory [ core ]
#> * Upload: [ 2021-01-29_11-40-58_1_0_non_repeated_measures.csv ] to directory [ core ]
#> * Upload: [ 2021-01-29_11-40-58_1_0_yearly_repeated_measures.csv ] to directory [ core ]
#> * Upload: [ 2021-01-29_11-40-58_1_0_monthly_repeated_measures.csv ] to directory [ core ]
#> ######################################################
#>   Converting and import successfully finished
#> ######################################################
#>  * Reinstate default working directory
#>  * Cleanup temporary directory
# upload the outcome variables
du.upload(
  cohort_id = 'gecko', 
  dict_version = '1_1', 
  dict_kind = 'outcome', 
  data_version = '1_0', 
  data_input_format = 'CSV',
  data_input_path = 'https://github.com/lifecycle-project/ds-upload/blob/master/inst/examples/data/WP6/nd_data_wp6.csv?raw=true',
  run_mode = "non_interactive"
)
#> ***********************************************************************************
#>   [WARNING] You are not running the latest version of the dsUpload-package.
#>   [WARNING] If you want to upgrade to newest version : [ 4.0.6 ],
#>   [WARNING] Please run 'install.packages("dsUpload", repos = "https://registry.molgenis.org/repository/R/")'
#>   [WARNING] Check the release notes here: https://github.com/lifecycle-project/analysis-protocols/releases/tag/4.0.6
#> ***********************************************************************************
#> ######################################################
#>   Start upload data into DataSHIELD backend
#> ------------------------------------------------------
#>  * Create temporary workdir
#> ######################################################
#>   Start download dictionaries
#> ------------------------------------------------------
#> * Download: [ 1_1_monthly_rep.xlsx ]
#> * Download: [ 1_1_non_rep.xlsx ]
#> * Download: [ 1_1_weekly_rep.xlsx ]
#> * Download: [ 1_1_yearly_rep.xlsx ]
#>   Successfully downloaded dictionaries
#> ######################################################
#>   Start importing data dictionaries
#> ######################################################
#>  * Check released dictionaries
#> ------------------------------------------------------
#>   Start creating project: [ lc_gecko_outcome_1_1 ]
#> * Project: [ lc_gecko_outcome_1_1 ] already exists
#> ------------------------------------------------------
#>   Start importing dictionaries
#> * Table: [ 1_0_monthly_rep ] already exists
#> * Matched categories for table: [ 1_0_monthly_rep ]
#> * Import variables into: [ 1_0_monthly_rep ]
#> * Table: [ 1_0_non_rep ] already exists
#> * Matched categories for table: [ 1_0_non_rep ]
#> * Import variables into: [ 1_0_non_rep ]
#> * Table: [ 1_0_weekly_rep ] already exists
#> * Import variables into: [ 1_0_weekly_rep ]
#> * Table: [ 1_0_yearly_rep ] already exists
#> * Matched categories for table: [ 1_0_yearly_rep ]
#> * Import variables into: [ 1_0_yearly_rep ]
#>   All dictionaries are populated correctly
#> ######################################################
#>   Start converting and uploading data
#> ######################################################
#> * Setup: load data and set output directory
#> ------------------------------------------------------
#> * Generating: non-repeated measures
#> * Generating: yearly-repeated measures
#> * Generating: monthly-repeated measures
#> * Generating: weekly-repeated measures
#> * Upload: [ 2021-01-29_11-41-21_1_0_weekly_repeated_measures.csv ] to directory [ outcome ]
#> * Upload: [ 2021-01-29_11-41-21_1_0_non_repeated_measures.csv ] to directory [ outcome ]
#> * Upload: [ 2021-01-29_11-41-21_1_0_yearly_repeated_measures.csv ] to directory [ outcome ]
#> * Upload: [ 2021-01-29_11-41-21_1_0_monthly_repeated_measures.csv ] to directory [ outcome ]
#> ######################################################
#>   Converting and import successfully finished
#> ######################################################
#>  * Reinstate default working directory
#>  * Cleanup temporary directory

IMPORTANT: You can run this package for the core variables and for the outcome variables. Each of them requires changing some parameters in the function call. So dict_kind specific ‘core’ or ‘outcome’ variables and dict_version specifies the data dictionary version (check the changelogs here: https://github.com/lifecycle-project/ds-dictionaries/tree/master/changelogs).

IMPORTANT You can specify your upload format! So you do not have to export to CSV first. Supported upload formats are: ‘SPSS’, ‘SAS’, ‘STATA’ and ‘CSV’.

Import the data

If you run these commands, your data will be uploaded to the DataSHIELD backend. If you use Opal, you can now import these data into the tables manually.

A video guiding you through the process can be found here: import data into Opal

Alternatively, execute these actions for Opal:

  1. Navigate to the Opal webinterface
  2. Login with you credentials
  3. Select “Projects”
  4. Select “lc_#cohort#_#dict_kind#_x_x”
  5. Select “Import”
  6. Choose the CSV-file
  7. Select the target table (depending on your choice regarding the file you have chosen)
  8. Click on “Next”
  9. Click on “Next”
  10. Determine that you all variables are matched, otherwise you need to fix your source-data first

IMPORTANT: make sure no NEW variables are introduce 11. Click on “Finish” 12. Check the “Task logs” (on the left side of the screen, in the icon bar)

It will match your data dictionary and determine which variables are matched or not. You can re-upload the source files as often as needed.