Comp 790-133: Language and Vision

Instructor: Tamara Berg  (tlberg -at-
Office: FB 236
Lectures: Tues 2:00-4:30pm, Rm SN-011
Office Hours: Tues 1:00-2:00pm and by appointment
Course Webpage:


  • Welcome to Language and Vision!
  • 1/13/15 Look over the topic list and email me your top 3 discussion choices by Thursday (1/15/15).
  • 1/27/15 HW1 is online here, due Feb 16.
  • 2/17/15 Classes are cancelled due to weather. Schedule updated accordingly. We will hold a make-up class, tentatively Friday Feb 27 4pm.
  • 2/17/15 HW2 is online here, due March 6.
  • 3/2/15 If you have a project idea or are looking for a collaborator post something here


This course will explore topics straddling the boundary between Natural Language Processing and Computer Vision. Now that basic visual recognition algorithms are beginning to work, we can think about predicting higher level interpretations of images necessary for general image understanding. These interpretations can be aided by text associated with images/videos and knowledge about the world learned from language. On the NLP side, images can help ground language in the physical world, allowing us to develop models for semantics. Language and Vision is a natural place to explore these questions as words and pictures are often naturally linked online and in the real world, and each modality can provide reinforcing information to aid the other. In this course, we will learn how to make use of the complementary nature of words and pictures through topic lectures and discussions about state of the art research. Students will be responsible for completing 2 HW assignments, reading research papers, and participating in discussions. They will also have a chance to explore a topic of their choice in more depth through a class project.

Topics (a subset of these will be covered based on time)
  • Basic Background in NLP, Vision, and Machine Learning
  • Features and Representations
  • Image Retrieval
  • Multi-modal Clustering and Word Sense Disambiguation
  • Text as weak labels for image or video classification
  • Image/Video Annotation and Natural Language Description Generation
  • Auto-illustration
  • Natural Language Grounding & Learning by Watching
  • Learning Knowledge from the web
  • Deep Learning for Images, Text, and multi-modal data


No prior experience in computer vision or natural language processing is required to take this course although some knowledge of these areas or machine learning will be useful. Students are allowed 3 free late days for assignments over the semester. Afterward late assignments will be accepted with a 10% reduction in value per day late.


There will be 2 homeworks assigned during the first two months of the course to get students aquainted with topics in Language and Vision. Over the final month of the course students will develop their own project related to language and vision. This will include a proposal presentation, a written update, and a final presentation and written document. Projects should involve some amount of text and image processing, but the exact topic and amount of language or vision involved can be determined by the student in consultation with the instructor. Students will also be responsible for reading assigned research papers, submitting short paper summaries, and participating in class discussions. Paper summaries should be submitted in hard copy at the start of each class when there are papers assigned. During 1 class students will be in charge of facilitating discussion of an application related to one of the research topics.

Assignments and Projects may be completed in pairs with one submission per pair. Assignments may be discussed with anyone in the class, but each pair of students should implement their own assignments. Code from the internet is allowed, but must be cited and the extent of internet code incorporated must be indicated in comments within the code as well as in the write-up.

Grading will consist of: Assignments (30%), Project (40%), Participation (30%).

Tentative Schedule

DateTopic Readings Discussion LeadsAssignments
Jan 13Intro to the Course (slides1), Overview of Computer Vision (slides2)---
Jan 20Overview of NLP (slides1), Features & Representations (slides2)---
Jan 27Image Retrieval (slides, discussion slides)"PageRank for Product Image Search",
"Animals on the Web"
Brian, AndrewHW1 out, paper summaries
Feb 3Overview of Machine Learning Techniques (slides)---
Feb 10Clustering (slides)"Computing Iconic Summaries of General Visual Concepts",
Who's in the Picture?
Chen-Yang, Alexispaper summaries
Feb 17Classes canceled due to weather--HW2 out
Feb 24Classification - Text as weak labels (slides) "Building text features for object image classification",
"Learning realistic human actions from movies"
Chris, Chun-Weipaper summaries
Feb 27, 4pm Make-up class - overview of available project resources + brainstorming (slides1, slides2)---
March 3Attributes (slides)"Automatic Attribute Discovery and Characterization from Noisy Web Data",
"Relative Attributes"
Natalie, Yipinpaper summaries
March 10Spring Break---
March 17Project Proposals--10 minute project proposal presentation, 2 page write-up
March 24Description Generation (slides)"Baby Talk: Understanding and Generating Simple Image Descriptions",
"Collective Generation of Natural Image Descriptions",
Generating Natural-Language Video Descriptions Using Text-Mined Knowledge"
Kyle S, Liang, Lichengpaper summaries
March 31Auto-Illustration (slides)"WordsEye: An Automatic Text-to-Scene Conversion System",
"Learning Spatial Knowledge for Text to 3D Scene Generation"
Rob, Matthewpaper summaries
April 7Learning Knowledge from Data"NEIL: Extracting Visual Knowledge from Web Data",
"Bringing Semantics Into Focus Using Visual Abstraction"
Carl, Hasan4 page project progress report, paper summaries
April 14Deep Learning"Deep Visual-Semantic Alignments for Generating Image Descriptions"Eunbyung, Kyle Mpaper summaries
April 21Final Project Presentations--10 minute final project presentation
May 4---Final project write-up due (8 pages, conference paper layout)

Reference Books
1) Forsyth, David A., and Ponce, J. Computer Vision: A Modern Approach, Prentice Hall, 2003.
2) Hartley, R. and Zisserman, A. Multiple View Geometry in Computer Vision, Academic Press, 2002.
3) Jurafsky and Martin, SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, McGraw Hill, 2008.
4) Christopher D. Manning, and Hinrich Schuetze. Foundations of Statistical Natural Language Processing