In this meeting to continue work on the “WhatEvery1Says” project, members of 4Humanities@UCSB working on the text-preparation team will perform the following tasks:
- Share notes on and fine-tune their ongoing work in archiving and transforming documents from the corpus for a sample set stored on Google Drive.
- Discuss and split up next-stage work. The next stage of work on the project involves 2nd-stage text transformations of the texts extracted from the corpus, including:
- Create a stop list;
- Identify and transform relevant bigrams (e.g., “social sciences”) into unigrams (“social-sciences”);
- Use parts-of-speech taggers to experiment with subtracting verbs, etc., to improve interpretability of topic modeling;
- Use named-entity parsers to identify proper names, etc., that can be put in the stop list or set aside for social-network analysis separate from the topic modeling).
- Discuss initial work with the MALLET suite of tools for topic modeling.