Linguistic Data Analysis
University of California, Merced
COGS 180: Linguistic Data Analysis with R
Summer 2020
Lecturer: Adam M. Croom, Ph.D.
Lectures: Tuesdays and Thursdays, 11:15 am – 2:45 pm, Glacier Point (GLCR) 120
Office Hours: By appointment over Zoom or at Social Sciences and Management Building (SSM) 250B
Course Description: Natural language processing (NLP) is an increasingly popular and important subfield of artificial intelligence that is concerned with using programming languages like Python and R to analyze large amounts of language data from sources like social media sites and text corpora. NLP researchers, for example, can use their skills to identify the most common themes or words used in songs highly ranked on Billboard’s The Hot 100, to identify the overall mood or sentiment in response to a politician’s Tweet, and even to help determine whether a pair of anonymous manuscripts are likely to be composed by the same author or not. The wide variety of ways that NLP can be usefully applied makes learning NLP both exciting and valuable for academic researchers, (cognitive, computer, and data) scientists, and many other professionals in health, business, and technology. The purpose of this course is therefore to introduce students to NLP fundamentals along with the basic programming skills required to complete their own NLP projects. This course will focus on the R programming language and will cover NLP fundamentals including importing text data, wrangling text, visualizing text, topic modeling, sentiment analysis, and working with API clients.
Outcomes: Upon completion of this course students will acquire competency in R programming, data analysis, and natural language processing (NLP) fundamentals. In the lab component of this course, students will complete approximately 48 hours of programming exercises covering introductory and intermediate programming with R (data wrangling, data grouping and summarizing, and data visualization) as well as natural language processing (importing text data, wrangling text, visualizing text, topic modeling, sentiment analysis, and working with API clients like Twitter). Students will also complete a final knowledge synthesis report where they discuss how to employ the skills learned from this course to complete an NLP project for their employer. Upon completion of this course students will have the fundamental skills required to complete a NLP project of their own as well as the ability to explain the significance of their NLP results to others.
Reading: The readings will be available on CatCourses.
Programming: Selected programming exercises, data sets, and projects will be available on Data Camp.
Assignment Submissions: Submit your assignments by uploading them directly onto CatCourses.
Grading Procedures: Your grade for this course will be based on your performance on regular activities (80%) and a final knowledge synthesis report (20%). A grading rubric will be provided along with your assignments.
Academic Integrity: Each student must abide by the Academic Honesty Policy at the University of California, Merced. You must do all of your own work on homework assignments, projects, and exams, and copying is never allowed. Violations of academic integrity will result in disciplinary action.
Accommodations for Students with Disabilities: The University of California is committed to ensuring equal opportunities and inclusion for students with disabilities based on the principles of independent living, accessible universal design, and diversity. The University of California requests for academic accommodations to be made during the first three weeks of the semester, except for unusual circumstances, and students are encouraged to register with the Disability Services Center to verify their eligibility for appropriate accommodations. I am available to discuss appropriate academic accommodations that may be required for students with disabilities, so if you have any questions about this please feel free to ask.
Additional Remarks: This syllabus is tentative and subject to change so stay tuned for updates. To create an optimal learning environment computers can only be used for class exercises and notes (no gaming or social media, etc.) and no audio or video recordings are allowed in class. If you have any questions or want to talk more about the course, majoring in cognitive science, or your future career, I encourage you to visit me during office hours for a chat. I value your contributions in class and look forward to seeing you develop this semester.
Schedule Overview
- May 26 Introduction to Linguistic Data Analysis
- May 28 Introduction to R
- June 2 Intermediate R
- June 4 Introduction to the Tidyverse
- June 9 Introduction to Data Visualization
- June 11 Introduction to Importing Data in R
- June 16 Cleaning Data in R
- June 18 Introduction to Text Analysis
- June 23 Text Mining with Bag-of-Words
- June 25 Sentiment Analysis in R
- June 30 Sentiment Analysis in R the Tidy Way
- July 2 Working with Web Data in R
- July 6 Final Knowledge Synthesis Report