Teaching plan for the course unit

 

 

Close imatge de maquetació

 

Print

 

General information

 

Course unit name: Natural Language Processing

Course unit code: 572673

Academic year: 2021-2022

Coordinator: Daniel Ortiz Martinez

Department: Faculty of Mathematics and Computer Science

Credits: 3

Single program: S

 

 

Estimated learning time

Total number of hours 75

 

Face-to-face and/or online activities

30

 

-  Lecture

Face-to-face and online

 

15

 

-  Lecture with practical component

Face-to-face

 

15

Supervised project

15

Independent learning

30

 

 

Recommendations

 

Learn how to use numpy properly before joining the course.

 

 

Competences to be gained during study

 

To know how to gather and extract information from structured and non-structured data sources.

 

To know how to clean and transform data with the goal of creating valuable,manageable, and informative data sets.

 

To be able to use storage and processing technologies for handling large data sets.

 

To efficiently and effectively apply analytic and predictive machine learning.

 

To communicate results using appropriate communication skills and visualization tools and techniques.

 

 

 

 

Learning objectives

 

Referring to knowledge

Getting to know the basic concepts in the area of natural language processing.

 

Discovering and using the basic natural language processing methods.

 

 

Teaching blocks

 

1. Overview of the course

2. Representing text for machines, working with strings, formatting strings, useful functions for strings, parsing text documents and creating a corpus

2.1. How to format strings

2.2. Relevant functions for strings

2.3. Parse text documents

3. Regular expressions

3.1. How to define regular expressions for extracting relevant information

3.2. Regular expressions for tokenizing sentences

4. Language models and edit distance

4.1. How can we define metrics over strings?

4.2. Detecting similar words to a misspelled word

4.3. Fast retrieval of similar words: BK tree datastructure, Using Cython to speed up distances

4.4. NLTK

5. Document representations for classification

5.1. Scipy.sparse matrices, efficient ways to operate with sparse matrices

5.2. Using machine learning models such as  Logistic regression, Perceptron, Support Vector Machines and Multilayer Perceptron

5.3. Using Pipelines to group and train machine learning and preprocessing steps

6. Document representations for retrieval tasks

6.1. How can we represent documents in a way that is suitable to define a metric between documents?

6.2. Tf-idf feature vector

6.3. Techniques and data structures for fast retrieval of similar words

7. Clustering documents

7.1. Understand how to retrieve similar documents to a query document using a model

7.2. Group related documents using different approaches, k-means and mixture of gaussians

8. Sequence models for NLP taks. Named Entity Recognition and Part of Speech tagging

8.1. Understanding how to find the most likely hidden state sequence according to some scores. Use it to anonimize documents

8.2. Implementing a Hidden Markov model

8.3. Implementing a  Structured Perceptron model

9. Dense word representations for words (word2vec and similar models), understand how word embeddings work

9.1. Learn how to benefit from pretrained embeddings

9.2. Generate combined features representations of counts and word embeddings for better document classification

10. Recurrent neural networks for NLP tasks

10.1. How to implement models that use recurrent loops such as LSTM and GRU Cells

 

 

Teaching methods and general organization

 

Classes will have 2 parts. 

 

- Key conceptual ideas and basic theory will be explained in slides and blackboard

- Practical knowledge will be showed executing jupyter notebooks.

 

 

 

Official assessment of learning outcomes

 

The evaluation is based on: 2 deliverable projects (70%) and a final exam (30%)

  •  Projects count 70% of the finall mark.
    • There are two projects to be delivered (with code to replicate results). 
    • Projects will consist on working with a "challenge", students will be given a task (dataset and metric) and they will have some freedom to explore different strategies to solve the task.
  • The final exam counts 30% of the final mark. 
    • Final exam will evaluate all the material in the course.

 

 

Reading and study resources

Consulteu la disponibilitat a CERCABIB

Book

Foundations of Statistical Natural Language Processing (Christopher D. Manning and Hinrich Schütze)

Neural Network methods in natural language processign (Yoav Goldberg, Graeme Hirst)

Deep Learning in natural language processing (Li Deng, Yang Liu)

Regex Quick Syntax Reference (Zsolt Nagy)