Acquiring Text

LIS 4/5693: Information Retrieval and Text Mining

Dr. Manika Lamba

Introduction

Digital Trace Data

The past decade has witnessed an increasingly voluminous amount of digital data that is produced on the internet which describes human behavior and other objects of scholarly inquiry. As this figure shows, recent decades have not only witnessed an increase in the amount of text based data, but also increased computing power which is increasingly necessary to analyze it. Together, these two shifts hold the potential to significantly expand the scope of research in many different fields.

We define Digital trace data, also referred to as digital footprint or digital breadcrumbs, as the unprecedented amount of data generated from blogs, web search data, social media sites, other Internet forums, administrative data on sites, Internet archive, digitized text data archive, audiovisual data, telecommunication, or geospatial data.

This voluminous amount of digital data generated on the Internet helps map human behavior and how we interact in different groups of communities and societies. It also opens the gates to perform various analytics that can then be used to do some social good.

In the next few slides we will discuss some of the important strength and weaknesses of digital trace data.

Strengths of Digital Trace Data

1. Always On

Strengths of Digital Trace Data (Cont.)

2. Non-Reactive

Strengths of Digital Trace Data (Cont.)

3. Captures Social relationships

Weakness of Digital Trace Data

Despite the considerable advantages of digital trace data, they also create a range of challenges for empirical observation and causal inference

1. Inaccessible

Weakness of Digital Trace Data (Cont.)

2. Non-Representative

Weakness of Digital Trace Data (Cont.)

3. Drifting

Weakness of Digital Trace Data (Cont.)

4. Algorithmic Counfounding

Sometimes, digital trace data that appear to describe human behavior actually reflect changes in the way humans interact with algorithms.

One popular example of this is the “parable of Google Flu.” Google Flu was once a popular tool that allowed users to estimate the prevalence of influenza using Google search data. The tool was so accurate that some suggested it should displace official surveys from the Centers for Disease Control (CDC).

Yet in early 2013, Google estimates were far higher than those from the CDC. Researchers later discovered that estimates of influenza had been inflated by google advertising links about the flu that people were clicking on that had appeared in their web browsers after they searched for information about symptoms of the common cold. This is sometimes refered to as “blue-team” dynamics.

Weakness of Digital Trace Data (Cont.)

5. Unstructured

Weakness of Digital Trace Data (Cont.)

6.Sensitive

Weakness of Digital Trace Data (Cont.)

7. Incomplete

Weakness of Digital Trace Data (Cont.)

8.Elite Bias

You know the famous saying, “history is written by the victors”? Much digital trace data is also created by people who are elites, and who might provide selective or incomplete accounts of what is going on, or worse.

9. Positivity-Bias

Finally, digital trace data often have performative dimensions. Many people do not report negative information about themselves online precisely because they know that their friends, colleagues – or other people they do not know – may be watching them. This creates another common form of bias in social media research.

Future of Digital Trace Data

We are Data

We are filled with data in today’s networked society

through our web activity, we are assigned gender, ethnicity, class, age, education level, and potential status of parent with x no. of children (digital trace data/digital footprint/digital breadcrumbs)
if internet metadata identifies a user as foreigner than they lose right to privacy afforded to U.S. citizens
who would have thought that class status, citizenship, ethnicity could be algorithmically understood?

We are Data (Cont.)

We live in a world of ubiquitous networked communication

technologies that constituent the Internet are so woven into the fabric of our daily lives, where for most of us, existing without seems unimaginable

We also live in a world of ubiquitous surveillance

same technologies have helped spawn an impressive network of governmental, commercial, and unaffiliated infrastructures of mass observation and control
most of what we do in this world has at least the capacity to be observed, recorded, analyzed, and stored in a databank
- HOW?
  - storage is cheap
  - computers are fast to analyze information in both real time & retrospective
  - our daily activities that are mediated with software can be easily configured to record and report everything it sees upstream

We are Data (Cont.)

We call people ‘terrorists’ based on metadata; We kill people based on metadata

data-based attack is a ‘signature strike’
- a strike that requires no ‘target identification’ but rather an identification of groups of men who bear certain signatures or defining characteristics associated with terrorists activity but whose identities are unknown
US drone program in early 2000s, strikes were “targeted”
US does not publicly differentiate between its “targeted” and “signature” strikes
- shift in spike in frequency of drone attacks from 49 between 2004 and 2008 to 372 between 2009 and 2015

Algorithmic Confounding/Biasness

It occurs when a computer system reflects the implicit values of the humans who are involved in collecting, selecting, or using data

Algorithmic Confounding/Biasness (Cont.)

Algorithms might disseminate social biases against certain groups of sociodemographic factors (such as race, gender, geography)
The output of these algorithms is primarily dependent on the annotated datasets and is sensitive to social bias created by humans
An algorithm that uses both text and metadata to learn is likely to be highly biased as metadata consists of the author’s nationality, discipline, etc., when compared to an algorithm with text-only data
Even with text-only data, algorithms will still learn bias due to the language problems generated by second-order effects for text-based machine learning
Additionally, when using chatbots (such as ChatGPT) to provide realtime recommendations, the dialogue of chatbot can be modelled with available metadata to adjust the features of the replier in terms of gender, age, and mood (Metaphors in HCI)

Ways to Mitigate Biases

Understanding how the data was generated
Using tools that identify bias in models and algorithms such as FairML, IBM AI Fairness 360, Accenture’s “Teach and Test” Methodology, Google’s What-If Tool, and Microsoft’s Fairlearn
Making the data, process, and outcome open, thus making it transparent and helping us to judge
Creating algorithms and standards that can be adapted from one application to another
Following the set of standards proposed by the Association for Computing Machinery US Public Policy Council and applying them at every stage in the algorithm creation process
Enforcing accountability in policies during auditing in pre-and post-processing as well as standardized assessment as algorithms do not make mistakes, but humans do

What Constitutes Research Data?

University: “Material or information on which an argument, theory, test or hypothesis, or another research output is based” (Queensland > University of Technology. Manual of Procedures and Policies. Section > 2.8.3.)

Digital Project Management: “What constitutes such data will be determined by the community of interest through the process of peer review and program management. This may include, but is not limited to: data, publications, samples, physical collections, software and models” (Marieke Guy)

Government Institution: “Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues” (OMB-110, Subpart C, section 36, (d) (i))

Data Science: “The short answer is that we can’t always trust empirical measures at face value: data is always biased, measurements always contain errors, systems always have confounders, and people always make assumptions” (Angela Bassa)

Forms of Data

A small list of open multimedia formats. For a list of file formats, consider checking out the Library of Congress’ list of Sustainability of Digital Formats.

Importance of Using Open Data Formats

A screenshot of a dataset as a cvs file, uncompressed, and follow an open standard

A screenshot of the same dataset in an Excel file (.xlsx). Unlike the previous image, this is a proprietary format

Challenge: Forms of Data

As you inspect the information present in each image, consider these questions:

What are some forms of data used in the project?
What are some forms of data outputted by the project?
Where was the data retrieved from to complete the project?

Human Computers at NASA is an archival project that “seeks to shed light on the buried stories of African American women with math and science degrees who began working at NACA (now NASA) in 1943 in secret, segregated facilities.”

Listen for the Iraqis in NYC! is an audio community mapping project that seeks to locate the Iraqi population in NYC using their own voices.

This slides show two front matter pages of two distinct digital projects. As you inspect the information present in each image, consider these questions:

What are some forms of data used in the project?
What are some forms of data outputted by the project?
Where was the data retrieved from to complete the project?

From the left side image, we can deduce that newspaper articles (digital copies of text) and photographs (digital copies of images) were used to compile this archive. Noticing the highlighted name in the news article, the data may be outputted as searchable text, searchable database, and/or searchable images. The data most likely was retrieved from a database and/or non-digital field notes. This is the data source page for Human Computers At NASA.

From the right side image, we can deduce that audio recordings of participants and a map (geospatial data) were used to compile this project. Given the details in the text on the right of the screen, we learn that the researcher will provide a map (geospatial data) and testaments (audio files) for us to peruse. The researcher has gathered digital field notes in the form of audio files from participants through a survey. The Call for Participants for Listen for the Iraqis in NYC! can be found here.

Institutional Compliance for Data and Research

The Institutional Review Board (IRB) is a floor for ethical responsibility at universities that came to pass after outrage about horrific unethical research studies done on people. A prime example of these grotesque studies is the Tuskegee Syphilis Study (1932-1972).
Born from concerns of the ethical choices made in biomedical and behavioral research, IRB compliance is not broadly applicable.
This leaves holes in institutional ethical regulations and requires researches in other fields, such as the social sciences, to find other ethical regulations or devise field specific ethical considers.

When is an IRB required?

Usually, IRB review is required when ALL of the criteria below are met:

The investigator is conducting research or clinical investigation,
The proposed research or clinical investigation involves human subjects, and
The university or research institution is engaged in the research or clinical investigation involving human subjects.

Stages of Data

We begin without data. Then it is observed, or made, or imagined, or generated. After that, it goes through further transformations. Stages of data typically consist of:

Collection of "raw" data

We start with formulating a research question(s) or hypotheses and set up a project to answer our question(s).

E.g. What proportion of the artwork collected and/or hosted in the Met are by non cis-gender men artists and also in public domain?

Processing and/or transforming data

In the process of setting up the project, we make decisions on what kind of data we think can help us to answer the question.

E.g. We may retrieve the data from the Met’s open access data set. We will need to look at what variables exist in the dataset to find out if we can filter by gender and the variables that will correspond to copyrights.( Note: if the file opens as a web page, you would need to use your machine’s ‘save as’ option to save it as a csv file to view it in a tabular form.)

Cleaning

After collecting our data we then consider and make decisions in the processes of cleaning.

E.g. We have to transform some of the gender values and decide what to do with the missing fields.

Analysis

We then run our preliminary analysis of the data.

E.g. We can run an analysis of the subset of non cis-gender men and public domain media objects against the total number of media objects to find out the proportion.

Visualization

At the end of our analysis, a decision is then made about how we would present the data and its analysis.

E.g. We can present the result in a pie chart or a bar graph.

Stages of Data: Non-Linear

Issues: During the Research Cycle

Ensuring good, useful data
Organizing and Structuring data for the user
Storing the data
Documenting and Describing the data
Supporting the Analysis of the data
Publishing data sets
Sharing the data and results

We as data science researchers/analysts play a specific role both DURING THE RESEARCH CYCLE and AFTER. This slide identifies the issues we encounter DURING the research cycle such as to ensure good, useful data is during the planning stage.

Understanding the goals of the research project and how the data will be analyzed and used are essential for helping the researcher organize and structure the data, as well as for choosing the appropriate storage option for the data as it is gathered by the researcher.

Documenting how the data are gathered and what the dataset includes, as well as assigning metadata to the data at the onset of the study ensure that data will be useful for the research in the current project but also in the future by the researcher and by others.

Good organization of the data also ensures the data can be analyzed to produce meaningful results. The last stage of this part of the research process is helping the researcher publish the data set and sharing the data and results.

Issues: After the Research Cycle

Curating and Preserving good, useful data
Preparing the data for the user
Ingesting and Storing the data
Ensuring privacy and securing the data
Re-using the data

Different Ways to Get Data

Text Mining Platforms

This slide introduce you to some of the major text mining platforms, which are environments designed to support large-scale text analysis by combining data access, computational tools, and infrastructure in one place. These platforms are especially important when working with licensed, copyrighted, or very large text collections that cannot simply be downloaded and processed locally.

Platforms like HathiTrust Research Center (will discontinue in end of 2026), ProQuest TDM Studio, and Gale Digital Scholar Lab provide access to massive corpora ranging from books and journals to newspapers and archival materials while enforcing legal and ethical constraints around use. In most cases, the primary data never leaves the platform; instead, researchers bring the code to the data and can download the secondary processed version of the data and visualizations.

Each platform offers a slightly different balance of features. Some emphasize built-in analytical tools, such as topic modeling, sentiment analysis, or word frequency visualizations, which allow users to explore texts without extensive programming. Others provide computational workbenches, often using Python or R notebooks, that support more advanced and customizable workflows.

It’s important to recognize that these platforms shape the kinds of questions you can ask. Limits on dataset size, available algorithms, metadata quality, and preprocessing choices all influence the research outcomes.

What is an Application Programming Interface?

The last two slides introduced you to some of the ways where you can get text data from particular databases or social media platforms through different platforms or Python/R packages.

Now, lets focus on two of the popular ways to retrieve data from digital trace data.

First one is API or Application Programming Interface. APIs have become one of the most important ways to access and transfer data online and increasingly APIs can even analyze your data as well. Compared to screen-scraping data, which is often illegal, logistically difficult (or both), APIs are a useful tool to make custom requests for data in manner that is well structured and considerably easier to work with than the HTML or XML data.

APIs are tools for building apps or other forms of software that help people access certain parts of large databases. Software developers can combine these tools in various ways or combine them with tools from other APIs in order to generate even more useful tools.

Most of us use such apps each day. For example, if you install the Spotify app within your Facebook page to share music with your friends, this app is extracting data from Spotify’s API and then posting it to your Facebook page by communicating with Facebook’s API. There are countless examples of this on the internet at present thanks in large part to the advent of Web 2.0, or the historical moment where the internet websites became became much more intertwined and dependent.

The number of APIs that are publicly available has expanded dramatically over the past decade, as the figure shows on the screen. The website Programmable Web lists more than 20,000 APIs from sites as diverse as Google, Amazon, YouTube, the New York Times, LinkedIn, and many others. Though the core function of most APIs is to provide software developers with access to data, many APIs now analyze data as well. This might include facial recognition APIs, voice to text APIs, APIs that produce data visualizations, and so on.

How Does an API Work?

Now broadly, what we are doing every time when we make an API call is that we as people who use an API is to make custom requests for data from that API. We do so by stitching together a long URL that has information in it about what we want to request and who we are. We then put that effectively into a web browser, and then scrape the information that’s delivered into the web browser by the API itself.

The figure on the slide shows the anatomy of a relatively simple API for google maps. We have the base API URL that we are going to put at the beginning of any API request.

Next, there’s a section where we are requesting certain information from a part of the API – which types of variables you want to collect. Sometimes, these are called endpoints in API. Some of the challenging parts of learning to work with API’s is learning not only to learn about jargon specific to software development but different types of jargon for different types of APIs.

Next, we are going to specify the type of data format we want. For instance, in this case, we are going to ask for JSON data.

Next, we are going to make a query, here what we are using is Google’s geocode tool and what that does is read in a piece of text and return everything that Google knows about it in terms of its geography.

Finally, the most important thing about API call is that it ends with something called an API Key. Now a Key is sometimes called a credential or a token. Basically, what it means is a secret code. Sometimes, there is more that one code that you have to stitch together into one long URL. It identifies you as an individual to the API that you are trying to work with this long code that has information about what data you are allowed to access and how often you are allowed to access it.

Getting API credentials, involves some type of application process. Learning how to identify the part of the API that you want to access can be challenging as for that you need to read through something called API documentation which is like an API instructional manual to know how API works.

Just like any instructional manual, some are better than other. Some of them contain detailed information about everything you could possible want to know, and sometimes too much information. Other might contain worked examples or vignettes that show you how to do it step-by-step.

Output of API Call

Rate Limiting

Before you make any more calls to APIs, you need to become familiar with an important concept called “Rate Limiting.”

The API credentials not only define what type of information we are allowed to access, but also how often we are allowed to make requests for such data. These are known as “rate limits”.

If we make too many requests for data within too short a period of time, an API will temporarily block us from collected data for a period of time that can range from 15 minutes to 24 hours or more, depending upon the API.

Rate limiting is necessary so that APIs are not overwhelmed by too many requests that occur at the same time, which would slow down access to data for everyone. Rate limiting also enables large companies such as Google or Facebook, to prevent developers from collecting large amounts of data that could either compromise their user’s confidentiality or threaten their business model (since data has such immense value in today’s economy!).

The exact timing of rate limiting is not always public, since knowing such time increments could enable developers to “game” the system and make rapid requests as soon as rate limiting has ended. Some APIs, however, allow you to make an API call or query in order to learn how many more requests you can make within a given time period before you are rate limited. Even if you do not violate rate limits, you can also be “throttled” for making too many requests overall, as the image below shows:

Challenges of Working with APIs

Screen-Scraping or Web Scraping

Is Screen-Scraping Legal?

In the early years of the internet, screen-scraping was a very common practice because there were not yet widespread legal norms surrounding the protection of data on the internet. This has changed drastically in recent decades as the value of data on websites has become obvious, and bots or automated computer programs can easily wreak havoc by collecting data from websites and repurposing it for nefarious purposes.

The very first thing you should consider before screen-scraping a website is whether you are allowed to do so. The easiest way to do this is to visit the “Terms of Service” (sometimes abbreviated as “Terms”) which often appears at the bottom of a web page.

These days, most websites have a “robots.txt” policy that specifies rules about automated data collection on the site, and an increasing number of sites do not allow such practices (especially larger websites such as Facebook, the New York Times, or Instagram). You should consult professional legal advice to determine whether you have permission to scrape a website.

So… When Should I Use Screen-Scraping?

Screenscraping can often be more trouble than its worth particularly in the age of Application Programming Interfaces. Still, there are some cases where screenscraping remains useful. One ideal use case is where you are scraping different pages within one web site again and again.

An example might include a local government website that posts information on events. Such a website may have the same “root” URL, but different suffixes that describe different days, months, etc. Assuming the structure of the pages are identical, one can then write a loop which switches out the end of the URL for different dates (using whatever naming convention is used by the site).

Another common reason to scrape a web page is if it is prohibitively difficult to copy and paste the information you seek from a website. Examples are extremely long pieces of text, or those that are embedded within complex tables (though again, parsing the HTML to identify the precise text you require could take a very long time as well).