Please wait...
From Embeddings to LLMs: Advanced Text Analysis with Python
About
Location:
Mannheim, B6 4-5
Mannheim, B6 4-5
Course duration:
09:30-12:30 and 13:30-16:30 (CEST / UTC+2)
General Topics:
Course Level:
Format:
Software used:
Duration:
Language:
Fees:
Students: 550 €
Academics: 825 €
Commercial: 1650 €
Keywords
Additional links
Lecturer(s): Hauke Licht
Course description
Basic “bag-of-words” methods of text analysis treat words or n-grams as distinct symbols and texts as unordered collections of such symbols. This inherently limits bag-of-words methods' ability to represent many of the nuances and subtilities of natural language that make studying text so interesting for social scientists. Deep learning methods for text embedding and neural language modeling help overcome the limitations of bag-of-words text analysis approaches and thus are an essential addition to the toolkit of computational social science researchers.
This course introduces social scientists to advanced, deep learning-based text analysis methods. Participants will learn about the conceptual motivation and methodological foundations of text embedding methods and large neural language models (LLMs). Moreover, they will gather plenty of practical experience with applying these methods in social science research using the Python programming language. Next to conveying a solid conceptual understanding as well as hands-on experience with applying these methods, the course puts a strong emphasis on introducing and discussing potential social science use cases as well as ethical considerations.
- We will start by discussing text embedding methods, beginning with an overview of static word embedding models like the GloVe and word2vec algorithms and followed by contextualized embeddings and the Transformer architecture. Participants will learn how to use embeddings for document search and clustering using the scikit-learn and sentence-transformers Python packages.
- We will then cover the methodological foundations of state-of-the-art pre-trained masked language models like BERT and introduce model finetuning. Participants will learn to apply pre-trained models in supervised learning tasks (single- and multilabel sentence classification, token classification) using the transformers and setfit packages and topic modelling with BERTopic.
- Next, we will focus on generative language models like GPT and the foundations of large language models (LLMs). Participants will learn techniques to prompt LLMs to analyze and annotate texts based on no or only a few labelled examples (i.e., zero-shot prompting and few-shot in-context learning) and how to implement these techniques using ollama and llama-index in Python.
This is an advanced-level course. Participants should have prior knowledge of basic text analysis techniques. Specifically, they should have experience with standard bag-of-words pre-processing techniques and text representation approaches, such as word count-based document-feature matrices. Those looking for a more introductory-level course should consider taking “Introduction to Machine Learning for Text Analysis with Python” (15-19 September). Moreover, participants should have experience with programming in Python. The lecturer cannot introduce or repeat basics in Python programming in the course due to limited time.
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.
Organizational Structure of the Course
The course will be organized as a mixture of lectures and exercise sessions. We will switch between lectures and exercises throughout the morning and afternoon sessions of the course. In the lecture sessions, I will focus on explaining core concepts and methods. In the exercise sessions, participants will apply their newly acquired knowledge. The lecturer will be available to answer questions and provide guidance during the entire course.
Target group
You will find the course useful if:
- you have a background in the social sciences or humanities (e.g., communication science, economics, political science, sociology, or related fields)
- you have a solid understanding of basic text analysis methods and
- you want to advance your knowledge, skills, and practical experience
- you want to get up to speed with applying state-of-the-art NLP methods to text analysis problems in social science research
Learning objectives
By the end of the course, you will:
- know the methodological foundations of text embedding methods, transfer learning, Transformers, large language models (LLMs)
- be able to apply these methods to analyze social scientific text data
- be able to reflect critically on the application of the techniques in social science research, including relevant ethical considerations
Prerequisites
- Prior knowledge of basic quantitative text analysis methods
- bag-of-words text pre-processing (“tokenization”) and representation (i.e., how to represent documents with word count vectors)
- (conceptual) knowledge of dictionary analysis, topic modeling, and supervised text classification methods is strongly recommended
- Basic knowledge of Python
- creating and manipulating strings, lists and dictionaries
- creating and interacting with objects, classes and methods
- reading and manipulating data frames with pandas
- using loops
- defining new functions
- Basic knowledge of quantitative research methods
- understanding of linear and logistic regression analysis
- a basic understanding of matrix algebra might be helpful but is not required
For those who would like a primer or refresher in Python, we recommend taking the online workshop “Introduction to Python” (25-28 August) and/or the online blended learning course “Introduction to Computational Social Science with Python” (01-05 September).
Software and Hardware Requirements
- You should bring your own laptops to this course.
- You should have Python (≥ 3.11), miniconda, pip, and Jupyter Notebook installed on your laptop.
- Required Python libraries
- text processing: nltk, genism, transformers, setfit, sentence-transformers, BERTopic, llama-index, ollama, llama-index-llms-ollama
- others: numpy, scipy, pandas, scikit-learn
- The lecturer will distribute concrete instructions for the Python setup and a comprehensive list of required libraries before the course and assist with any remaining setup problems on the first day of the course.
- It is recommended that you create a Google Colab account, especially if your laptop has no Nvidia GPU (Windows/Linux) or Apple Silicon chip (MacOS).