Module 10: Data Stewardship and Visualization

LIS 5043: Organization of Information

Dr. Manika Lamba

Introduction

What is Data Stewardship?

  • Data Stewardship is concerned with all aspects of the creation, management, analysis, and communication of data focusing particularly on the application of computational methods to digital data

  • Data Stewardship = Data Management + Data Curation + Data Analytics

    • Data management: Ensuring the management of data in order to better support the analysis of data
    • Data curation: Ensuring that data can be efficiently and reliably found and used
    • Data analytics: Employing specific techniques to extract knowledge from data
  • It includes among other things: acquisition and collection, modeling, workflow, provenance, validity and integrity, metadata, preservation, integration, retrieval, re- use, policy, standards, identifiers, format conversions, processing levels, supporting reproducibility, etc.

  • It includes active and on-going management of data through its lifecycle of interest and usefulness to scholarship, science, and education; curation activities enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time.

Science of… vs Practice of…

The science of data stewardship:

research and development on new methods of data management and use;
draws on mathematical and engineering methods, but also on methods from social science, law, economics, and other disciplines

The practice of data stewardship:

use and adaptation of data management methods to meet user needs and support data analytics

Values

Data analytics values: Extraction should be novel, fast, precise, accurate

Data stewardship values: Data should be efficient and reliable: findable, useable, legal (thereby supporting novelty, speed, precision, accuracy)

Importance of Data Stewardship

  • Where real world interdisciplinary challenges are concerned, managerial & curatorial problems are acute:

Large amounts of rapidly changing data, often heterogeneous in nature and developed by different scientific communities, must be found, retrieved, authenticated, reformatted, integrated with other data and managed for effective use, and demonstrably reliable even after processing and preparation

  • Supporting analysis, discovery, and use is an enormous challenge

. . . it involves the complex management of large-scale data storage and preservation, creation of metadata and tools for retrieval and context documentation, preparation of computationally accessible documentation of provenance and workflow, conducting reliable format conversions to support new tools and applications, the management of identifiers and validity checks that accommodate format changes, the integration of related data elements from substantially different data sources, and more. . . .

Importance of Data Stewardship (Cont.)

  • Without successful data management & curation, data analysis is not possible, it would be prohibitively expensive and and dangerously unreliable

  • Data Stewardship is the larger part of data science.

  • Not only Data Stewardship is essential for reliable efficient analysis, but most of the cost associated with using data is, by far, in management & curation, not analysis, and most of the workforce needs are, also by far, in management & curation, not analysis.

Ask any data manager in industry will tell you, it is the management & curatorial work where they make the largest investment, of money, staff, time, and effort

Broader Activities

Some of the broader activites in Data Stewardship includes:

Data Stewardship: Methods of Action

  1. Analysis: To determine needs, and develop relevant data models and metadata, and reformat, correct, or update data.
  2. Documentation: To record essential information (typically via metadata)
  3. System design and implementation: To support all data curatorial activities To support the generation and use of data documentation and processing documentation
  4. Policy: To specify objectives, procedures, practices, and formats.
  5. Process: To ensure success and efficiency by managing the development of appropriate organizational units and roles, providing training, advocating for change, and managing curatorial activities.

Data Stewardship Workforce

There is no single occupational category for [data stewardship] and no precise mapping between knowledge and skills needed for [data stewardship] and existing professions, careers, or job titles.

The knowledge and skills required of those engaged in [data stewardship] are dynamic and highly interdisciplinary. They include an integrated understanding of computing and information science, librarianship, archival practice, and the disciplines and domains generating and using data. Additional knowledge and skills for effective [data stewardship] are emerging in response to data-driven scholarship.

Who Does Data Work?

Some professional “data” jobs:

  • Data Scientist

  • Data/Business Analyst

  • Data Wrangler

  • Data Curator

  • Data Steward

  • Data Engineer

  • … ML, AI Engineer

and “database” jobs:

  • Database Engineer

  • Database Programmer

  • Database Architect

  • Database Administrator

and "library" jobs:

  • Research Data Services Librarian

  • Research Data Steward

  • Data Librarian

  • Data Scholarship Librarian

  • Digital Humanities Librarian

  • AI Librarian

What is Data?

We are Data

We are filled with data in today’s networked society

  • through our web activity, we are assigned gender, ethnicity, class, age, education level, and potential status of parent with x no. of children (digital trace data/digital footprint/digital breadcrumbs)

  • if internet metadata identifies a user as foreigner than they lose right to privacy afforded to U.S. citizens

  • who would have thought that class status, citizenship, ethnicity could be algorithmically understood?

We are Data (Cont.)

We live in a world of ubiquitous networked communication

  • technologies that constituent the Internet are so woven into the fabric of our daily lives, where for most of us, existing without seems unimaginable

We also live in a world of ubiquitous surveillance

  • same technologies have helped spawn an impressive network of governmental, commercial, and unaffiliated infrastructures of mass observation and control
  • most of what we do in this world has at least the capacity to be observed, recorded, analyzed, and stored in a databank
    • HOW?
      • storage is cheap
      • computers are fast to analyze information in both real time & retrospective
      • our daily activities that are mediated with software can be easily configured to record and report everything it sees upstream

Data Lifecycle

Data Management Across Research Lifecycle

Data Visualization

Why Create Visualizations Generally?

Academic and Professional Organizations

Conferences: Computing and Databases

  • ACM International Conference on Information & Knowledge Management (CIKM)
  • ACM Special Interest Group on Management of Data (SIDMOD)
  • Very Large Databases (VLDB)
  • IEEE International Conference on Data Engineering (ICDE)
  • ACM SIG on Human-Computer Interaction (SIGHCI)
  • International Provenance and Annotation Workshop (IPAW)
  • International Semantic Web Conference (ISWC)
  • ACM Conference on Fairness, Accountability, and Transparency (FACCT)
  • ACM Conference on Reproducibility and Replicability (REP)

Conferences: Digital Libraries, Curation, & Preservation

  • ACM Joint Conference on Digital Libraries (JCDL)
  • International Conference on Digital Preservation (iPres)
  • International Association for Social Science Information Service and Technology (IASSIST)
  • Open Repositories

Selected Journals

  • Transactions on Database Systems (TODS)
  • IEEE Transactions on Knowledge and Data Engineering (TKDE)
  • CODATA Data Science Journal
  • International Journal of Digital Curation (IJDC)

“Everyone wants to do the model work, not the data work”

  • Data quality is essential in machine learning and AI
  • Data often determines model performance, fairness, safety, scalability
  • This is particularly acute in high-stakes domains
    • Health, safety, environment
  • However, data work is often undervalued and not incentivized