4Humanities@UCSB Meeting – Topic Modeling “WhatEvery1Says” (Nov. 21, 2013)
4Humanities@UCSB will hold its second meeting of the 2013-14 academic year on Thursday, November 21 (noon – 1:30, South Hall 2509). Following up on its previous meeting, which focused on the “Heart of the Matter” report produced by the American Academy of Arts & Science’s Commission on the Humanities and Social Sciences, this meeting will be devoted specifically to designing a technical and interpretive strategy for digital-humanities analyses of the WhatEvery1Says corpus of statements about the humanities that 4Humanities has been gathering.
In particular, we will strategize “topic modeling” the WhatEvery1Says corpus as well as implementing other methods of text analysis (and visual analysis) designed to discover new facets of the U.S. and worldwide discussion of the humanities. Initial experiments with “The Heart of the Matter” report and a small subset of other documents include:
If you are interested in learning about such methods of analysis as they might help support humanities advocacy, please come to the meeting.
Topic Modeling the 4Humanities
What Every1Says Corpus: The Idea
Strategy (1) — Work Plan (Personnel, Time Line, etc.)
- Possible plan: start project at UCSB, then ask for critique of methods and further help from international DH community.
- Who at UCSB might want to participate?
- Involvement of students from the “Writing and Civic Engagement” minor in UCSB’s Professional Writing program (serving as interns)?
- Time line? Possible syncopated rhythm of 4Humanities@UCSB activities during rest of this year:
- Discussion meetings (e.g., on “Global Humanities,” “Humanities / Sciences”)
- Research workshop meetings (topic modeling, etc.)
Strategy (2) – Methodological and Technical Workflow (known issues & draft plan)
Red = Discrete Tasks
- Collection methodology for WhatEvery1Says corpus
- Critically examine the selection criteria. Currently, the criteria for inclusion of resources are:
- Online material.
- Textual documents (not audio or video resources).
- Documents in English.
- Documents more-or-less in the public milieu (including journalism, blogs, reports, white papers, etc., but not scholarly studies or research).
- Documents sized between posts/articles and reports (not books).
- Known needs:
- Most documents in the corpus are from the last two years. We need documents from past periods.
- Most documents in the corpus are from (or address) the North American context (with some representation of the U.K. and other nations/regions). We need more documents from other parts of the world.
- Collection format for the corpus (currently a Google spreadsheet). Should the holding format be (for example) a database, a Zotero collection, etc., to make it easier to filter, group, and extract metadata about the documents (e.g., citations)? (Related issue: collection platform for processed files.)
- Text extraction from HTML and PDF files. We need to research tools such as pdf2htmlEX to semi-automate the extraction of text from documents into “plain text.” (Related issue: exclusion of non-relevant material in documents such as advertisements, bios of authors, copyright notices, etc.)
- Text cleaning and preparation: Python scripts or other tools for fixing common errors, standardizing spellings, resolving hyphenations, etc. (Cf. Ted Underwood and Andrew Goldstone’s scripts and other resources for topic modeling).
- Interpretive text preparation:
- Consolidating semantically unitary bigrams into unigrams (e.g., “social sciences” into “social_sciences”).
- Filtering out proper names (experimenting with named-entity recognizers).
- Creating a “stop list.” (See Ted Underwood and Andrew Goldstone’s stop list for their project on topic modeling literary studies journals.)
- Experimenting with parts-of-speech taggers (POS) to filter out everything but nouns (cf. Matthew Jocker’s topic-modeling work).
- “Chunking” (breaking documents if needed into appropriately sized subdocuments).
- Topic Modeling:
- Experimenting to see if we should use the full-featured Mallet topic modeling suite or its Java implementation (Topic Modeling Tool).
- Experimenting with different parameters for the topic modeling, most importantly: number of topics to ask the algorithm to produce.
- Interpretive labeling of topics (and visualization of topics).
Strategy (3) – Outcomes
- Creation of interactive site for exploring the topic model of WhatEvery1Says. (Cf., DFR-Browser, a browser-based visualization interface created by Andrew Goldstone for exploring his topic model of JSTOR articles).
- Co-authored research report or article on outcomes.
- Workshop to brainstorm ways we can apply the outcomes in facilitating, guiding, or creating advocacy arguments and materials.