LIS 4/5693: Information Retrieval and Text Mining
1. Always On
2. Non-Reactive
3. Captures Social relationships
Despite the considerable advantages of digital trace data, they also create a range of challenges for empirical observation and causal inference
1. Inaccessible
2. Non-Representative
3. Drifting
4. Algorithmic Counfounding
5. Unstructured
6.Sensitive
7. Incomplete
8.Elite Bias
You know the famous saying, “history is written by the victors”? Much digital trace data is also created by people who are elites, and who might provide selective or incomplete accounts of what is going on, or worse.
9. Positivity-Bias
Finally, digital trace data often have performative dimensions. Many people do not report negative information about themselves online precisely because they know that their friends, colleagues – or other people they do not know – may be watching them. This creates another common form of bias in social media research.
through our web activity, we are assigned gender, ethnicity, class, age, education level, and potential status of parent with x no. of children (digital trace data/digital footprint/digital breadcrumbs)
if internet metadata identifies a user as foreigner than they lose right to privacy afforded to U.S. citizens
who would have thought that class status, citizenship, ethnicity could be algorithmically understood?

John Cheney-Lippold. (2017). We are Data: algorithms and the making of our digital selves. New York University Press.
John Cheney-Lippold. (2017). We are Data: algorithms and the making of our digital selves. New York University Press.
data-based attack is a ‘signature strike’
US drone program in early 2000s, strikes were “targeted”
US does not publicly differentiate between its “targeted” and “signature” strikes
John Cheney-Lippold. (2017). We are Data: algorithms and the making of our digital selves. New York University Press.
It occurs when a computer system reflects the implicit values of the humans who are involved in collecting, selecting, or using data
(Metaphors in HCI)Lamba, M., Madhusudhan, M. (2022). Text Data and Mining Ethics. In: Text Mining for Information Professionals. Springer, Cham. https://doi.org/10.1007/978-3-030-85085-2_11
FairML, IBM AI Fairness 360, Accenture’s “Teach and Test” Methodology, Google’s What-If Tool, and Microsoft’s FairlearnLamba, M., Madhusudhan, M. (2022). Text Data and Mining Ethics. In: Text Mining for Information Professionals. Springer, Cham. https://doi.org/10.1007/978-3-030-85085-2_11
University: “Material or information on which an argument, theory, test or hypothesis, or another research output is based” (Queensland > University of Technology. Manual of Procedures and Policies. Section > 2.8.3.)
Digital Project Management: “What constitutes such data will be determined by the community of interest through the process of peer review and program management. This may include, but is not limited to: data, publications, samples, physical collections, software and models” (Marieke Guy)
Government Institution: “Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues” (OMB-110, Subpart C, section 36, (d) (i))
Data Science: “The short answer is that we can’t always trust empirical measures at face value: data is always biased, measurements always contain errors, systems always have confounders, and people always make assumptions” (Angela Bassa)
A small list of open multimedia formats. For a list of file formats, consider checking out the Library of Congress’ list of Sustainability of Digital Formats.

A screenshot of a dataset as a cvs file, uncompressed, and follow an open standard

A screenshot of the same dataset in an Excel file (.xlsx). Unlike the previous image, this is a proprietary format
As you inspect the information present in each image, consider these questions:
What are some forms of data used in the project?What are some forms of data outputted by the project?Where was the data retrieved from to complete the project?
Human Computers at NASA is an archival project that “seeks to shed light on the buried stories of African American women with math and science degrees who began working at NACA (now NASA) in 1943 in secret, segregated facilities.”

Listen for the Iraqis in NYC! is an audio community mapping project that seeks to locate the Iraqi population in NYC using their own voices.
The Institutional Review Board (IRB) is a floor for ethical responsibility at universities that came to pass after outrage about horrific unethical research studies done on people. A prime example of these grotesque studies is the Tuskegee Syphilis Study (1932-1972).
Born from concerns of the ethical choices made in biomedical and behavioral research, IRB compliance is not broadly applicable.
This leaves holes in institutional ethical regulations and requires researches in other fields, such as the social sciences, to find other ethical regulations or devise field specific ethical considers.
Usually, IRB review is required when ALL of the criteria below are met:
The investigator is conducting research or clinical investigation,
The proposed research or clinical investigation involves human subjects, and
The university or research institution is engaged in the research or clinical investigation involving human subjects.
Ensuring good, useful dataOrganizing and Structuring data for the userStoring the dataDocumenting and Describing the dataAnalysis of the dataPublishing data setsSharing the data and resultsCurating and Preserving good, useful dataPreparing the data for the userIngesting and Storing the dataEnsuring privacy and securing the dataRe-using the data

