Please wait...
Web Data Collection with Python
About
Location:
Online via Zoom
Online via Zoom
Course duration:
14:00-17:30 (CEST / UTC+2)
General Topics:
Course Level:
Format:
Software used:
Duration:
Language:
Fees:
Students: 350 €
Academics: 500 €
Commercial: 1000 €
Keywords
Additional links
Lecturer(s): Iulia Cioroianu
Course description
The exponential increase in online and social media data offers unprecedented opportunities for advancing research across a variety of fields, both within academia and outside of it. For instance, diverse data such as election results, press releases, or social media posts can inform research questions in the social sciences. Although the availability of data online is steadily increasing, extracting these data is not always straightforward, especially since many popular social media sites have shut down or restricted access to their Application Programming Interfaces (APIs). Furthermore, the heterogeneity of data almost always requires reshaping these data before they can be used effectively for analysis, which can also be challenging. This course provides researchers with the tools needed to collect and pre-process large-scale data from a range of online sources.
Through a combination of lectures, hands-on tutorials and individual/group exercises, participants will develop a theoretical understanding of the challenges associated with online data collection and the best methods and tools for addressing them in R, as well as the practical skills needed to scrape data from both static and dynamic websites and collect data through APIs. The sources used in the examples provided include social media websites, online media outlets and news aggregators, government data portals, and other large-scale online data repositories.
Acknowledging that the most difficult part of a computational project involving the collection of complex and heterogenous data is often the pre-processing needed to prepare the data for subsequent analysis and link it across a variety of sources, the course also covers text-based methods for data cleaning and pre-processing. By the end of the week, participants should be able to apply the methods studied to extract and process data for their own research projects.
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.
Organizational Structure of the Course
The course will be offered online, and will be taught in Python in the afternoon. A parallel course taught in R takes place in the morning (09:30-13:00), and participants interested in taking part in both courses should contact the Fall Seminar team at fallseminar@gesis.org for a discounted rate. The content and examples used in the lecturer-led tutorials are similar across programming languages, making it easier for those interested in developing new skills in a secondary language that they may not be proficient in to do so by drawing parallels across the two courses.
We will start the daily sessions with a lecture laying out the main notions and providing an overview of the language-specific tools used (approximately 45 minutes), followed by a hands-on lecturer-led tutorial (45 minutes). The second part of the session will consist of several exercises that students are encouraged to solve in small groups or individually (90 minutes). Each exercise will be followed by an instructor-led discussion of the solutions. In the final part of the session, students will complete a short exercise as an individual assignment (30 minutes). The daily schedule is presented in the table below.
Theoretical overview and lecturer-led tutorial | 14:00-15:30 |
Individual or small-group exercises and solutions | 15:45-16:00, 16:10-16:30 |
Individual assignment | 16:30-17:00 |
The lecturer will provide continuous support during the exercise sessions, and will be available for individual consultations on participants' projects during the last day.
Target group
You will find the course useful if:
- You want to learn how to collect and process large amounts of data from online sources fast.
- You aim to improve your existing web scraping skills or have so far encountered difficulties trying to scrape data from online sources.
- You have a research idea for which online data might be suitable, but you are not sure of the practical implications.
Learning objectives
By the end of the course, you will:
- Understand the structure and basic features of different forms of online data.
- Be able to collect data from static and dynamic websites.
- Be able to interact with APIs to access and collect data.
- Be able to parse, clean and process the data collected.
- Be able to apply the methods studied to your own research projects.
Prerequisites
- Working knowledge of Python, including data structures and control structures.
- Participants attending both the R and the Python course should have working knowledge of each of the two programming languages.
- If you lack basic knowledge of these programming languages, you are encouraged to take the Introduction to Computational Social Science with R or Introduction to Computational Social Science with Python course in week 1 and/or the introductory online workshops (Intro to R, Intro to Python).
Software and Hardware Requirements
Participants should pre-install the following software and packages:
- Python 3
- required packages (final list of packages to be provided before the course): requests, lxml, BeautifulSoup, Selenium, pandas, re, stringr, NLTK