Supporting Data Collection: Organizing Structuring and Storing Data

LIS 4/5493: Data Stewardship

Dr. Manika Lamba

Introduction

Organizing Data

Organizing Data (Cont.)

Henderson (2017)

  • How large is the data set(s) and what is expected rate of growth?

  • Does the data contain sensitive information?

  • What publications or discoveries have resulted from the data?

  • How should the data be made accessible?

  • Who owns the data?

  • Types of data created or captured (experimental, qualitative, modeling, etc.) and how will it be captured or processed?

  • Contextual details about the data to make it meaningful to other users (metadata, file formatting, naming conventions)?

  • Storage, backup, and security plans (Where and on what media will it be stored? Backup plans for the data? How will data security be managed and by whom?)

  • Privacy/Protection (Ethical and privacy concerns, sensitive data, IRB issues, anonymization of data?)

  • Policies and restrictions related to access and re-use (gaining access to the data)?

  • Long term plan for preservation and maintenance of data?

Organizing Data Best Practices

Structuring Data

The structure of the data should be based on the project, audience, uses of the data and expected research outcomes

  • Choose a structure or software that is appropriate to the type(s) of data being collected AND how it will be analyzed
  • Be aware of common issues of data structuring
  • Adhere to best practices for using raw and processed data files
  • Document any processing done to files, copies made, new files developed, etc.

Common Issues in Data Structuring

Examples

Ensure Data is Machine Readable

Bad

Examples

Ensure Data is Machine Readable

Good

Examples

Ensure Data is Machine Readable

Ok

  • could help data entry

  • .csv or .tsv copy would need to be saved.

Choosing Structure/Software

Predominant structures for organizing data are:

  • Spreadsheets
  • Databases
  • Text files
  • Instrument-dependent data outputs that may be spreadsheets or databases

Spreadsheets vs. Databases

Spreadsheets

  • Mostly numeric data

  • Data requires calculations

  • Statistics and/or statistical analysis is needed

  • Data is non-relational

  • Few people need to work on the data

Databases

  • Large amount of data

  • Long strings of data in a field

  • Records require the use of a primary key or unique identifier

  • Requires relational abilities

  • Many people need to work with the data

Spreadsheets vs. Databases (Cont.)

Spreadsheets

  • Great for charts, graphs, calculations

  • Flexible about cell content type – cells in same column can contain numbers or text

  • Lack record integrity – can sort a column independently of all others

  • Easy to use - but harder to maintain as complexity and size of data grows

Databases

  • Easy to query to select portions of data

  • Data fields are typed. For example, only integers are allowed in integer fields

  • Columns cannot be sorted independently of each other

  • Steeper learning curve than a spreadsheet

Structuring Data

Storing Data

There are many options for storing data, some better than others

  • Personal laptops and computers
  • Network storage
  • External storage devices
  • Removable storage devices
  • Remote storage
  • Physical storage
  • Cloud storage
  • LOCKSS

Storing Data (Cont.)

Backing up raw and copies of data is essential!

  • Full backup of all files

  • Differential incremental backup of files changed since last incremental or full backup

  • Cumulative incremental backup of files changed since last full backup

Going Forward