LIS 4/5493: Data Stewardship
Henderson (2017)
How large is the data set(s) and what is expected rate of growth?
Does the data contain sensitive information?
What publications or discoveries have resulted from the data?
How should the data be made accessible?
Who owns the data?
Types of data created or captured (experimental, qualitative, modeling, etc.) and how will it be captured or processed?
Contextual details about the data to make it meaningful to other users (metadata, file formatting, naming conventions)?
Storage, backup, and security plans (Where and on what media will it be stored? Backup plans for the data? How will data security be managed and by whom?)
Privacy/Protection (Ethical and privacy concerns, sensitive data, IRB issues, anonymization of data?)
Policies and restrictions related to access and re-use (gaining access to the data)?
Long term plan for preservation and maintenance of data?
The structure of the data should be based on the project, audience, uses of the data and expected research outcomes
Ensure Data is Machine Readable
Bad


Ensure Data is Machine Readable
Good
Ensure Data is Machine Readable
Ok
could help data entry
.csv or .tsv copy would need to be saved.
Predominant structures for organizing data are:
Spreadsheets
Mostly numeric data
Data requires calculations
Statistics and/or statistical analysis is needed
Data is non-relational
Few people need to work on the data
Databases
Large amount of data
Long strings of data in a field
Records require the use of a primary key or unique identifier
Requires relational abilities
Many people need to work with the data
Spreadsheets
Great for charts, graphs, calculations
Flexible about cell content type – cells in same column can contain numbers or text
Lack record integrity – can sort a column independently of all others
Easy to use - but harder to maintain as complexity and size of data grows
Databases
Easy to query to select portions of data
Data fields are typed. For example, only integers are allowed in integer fields
Columns cannot be sorted independently of each other
Steeper learning curve than a spreadsheet
There are many options for storing data, some better than others
Backing up raw and copies of data is essential!
Full backup of all files
Differential incremental backup of files changed since last incremental or full backup
Cumulative incremental backup of files changed since last full backup
