4Humanities@UCSB Meeting — “WhatEvery1Says” Project (Text Preparation Team) (April 30, 2014)

In this meeting to continue work on the “WhatEvery1Says” project, members of 4Humanities@UCSB working on the text-preparation team will perform the following tasks:

  • Share notes on and fine-tune their ongoing work in archiving and transforming documents from the corpus for a sample set stored on Google Drive.
  • Discuss and split up next-stage work. The next stage of work on the project involves 2nd-stage text transformations of the texts extracted from the corpus, including:
    • Create a stop list;
    • Identify and transform relevant bigrams (e.g., “social sciences”) into unigrams (“social-sciences”);
    • Use parts-of-speech taggers to experiment with subtracting verbs, etc., to improve interpretability of topic modeling;
    • Use named-entity parsers to identify proper names, etc., that can be put in the stop list or set aside for social-network analysis separate from the topic modeling).
    • Discuss initial work with the MALLET suite of tools for topic modeling.

Leave a Reply

Your email address will not be published.