From Sentences to Insights: Massive Data Institute’s Text as Data Workshop Series

The Massive Data Institute, a program in the McCourt School of Public Policy that researches data science methodologies to improve public policy, kicked off their “Text as Data” workshop series with a two-part event about the challenges of using text data Sept. 18 and 19.

The series, organized by Massive Data Institute (MDI) Director Lisa Singh and Associate Director Anjelika Deogirikar Grossman, provides insight into the different applications of text data, the use of coding and software to analyze digital texts and extract specific data and insights.

Le Bao, a postdoctoral fellow at the MDI, was the featured speaker for the first of three workshops in the series, discussing measurement and inference issues with text data.

“We want to show everyone they can get something different from the workshop,” Bao told The Hoya. “Even if they have some background, we will give them some new perspectives and look at the existing issues or questions, and for people without much background, we also want to show them how to develop those skills.”

The MDI draws on Georgetown University’s expertise in computer science, data science, public health, public policy, and social science to shape policy with the goal of improving people’s lives through data-centric research.

“We provide experiential learning opportunities with faculty and external partners to help train the next generation of scholars and practitioners, exposing them to innovative methods for conducting interdisciplinary, data-centric research that facilitates policy making,” according to the MDI’s mission statement.

During the events, public policy and social science were a focal point when considering the application of text data. For many years, even though researchers had access to text data, they hardly touched it because it is difficult to analyze.

“It’s kind of unstructured and hard to study, especially compared to survey data,” Bao said. “Every unique word or term can represent a different dimension in a data. That’s kind of been the biggest challenge in statistical analysis.”

Edward Chen (MSB ’24) attended the first workshop and said it is important to study text data because there are valuable insights from data analysis that can be applied to research and public policy issues.

“Understanding how to extract meaningful information from unstructured text data is increasingly valuable in various professional fields,” Chen told The Hoya.

Twitter/ @MassiveData_GU | Massive Data Institute hosts the first session of the Text as Data workshop.

Chen said he appreciated how Bao showed attendees firsthand the value of text analysis through hands-on experience with analyzing political speeches.

“This hands-on approach facilitated a clearer understanding of the concepts and equipped attendees with practical skills that could be put into immediate use,” Chen said.

Singh said another one of the main focuses of the first workshop was explaining the challenges of text analysis. During his lecture, Bao emphasized the importance of customizing text analysis to the type of text that is being analyzed.

“You need to be super careful about the measurement and methodological decisions you make along the way of analyzing text because it’s unstructured and super high dimensional,” Bao said. “You sometimes will miss things if you don’t pay attention to the substantive meanings of the text.”

Bao used the example of analyzing speeches about abortion delivered by members of U.S. Congress to show this challenge of text data. If someone is following a formulaic approach to analyzing their text data, they may ignore pronouns, but in the case of abortion, pronouns like “she” and “her” can be very valuable to understand the sentiment of the message.

It is cases such as these that Bao pointed to when expressing the importance of the “Text as Data” workshop series, which allows attendees to train their own large language models. Bao explained that using large language models like ChatGPT doesn’t give people a chance to understand text analysis.

“Training a small model yourself will give you a lot of time to think about methodological issues within large language models,” Bao said.

MDI will host the next session, titled “Advanced Models Using Text,” on Oct. 23 and 24, and the final session, “Cutting Large Language Models Down to Size,” on Nov. 13 and 14.

Bao recommended that students register for these events through the MDI website if they want to understand text data and large language models, even if they did not attend the first workshop.

The Hoya

The Hoya

The Hoya

From Sentences to Insights: Massive Data Institute’s Text as Data Workshop Series

Comments (0)