Teaching plan for the course unit



Close imatge de maquetació




General information


Course unit name: Natural Language Processing

Course unit code: 572673

Academic year: 2021-2022

Coordinator: Daniel Ortiz Martinez

Department: Faculty of Mathematics and Computer Science

Credits: 3

Single program: S



Estimated learning time

Total number of hours 75


Face-to-face and/or online activities



-  Lecture

Face-to-face and online




-  Lecture with practical component




Supervised project


Independent learning






Learn how to use numpy properly before joining the course.



Competences to be gained during study


To know how to gather and extract information from structured and non-structured data sources.


To know how to clean and transform data with the goal of creating valuable,manageable, and informative data sets.


To be able to use storage and processing technologies for handling large data sets.


To efficiently and effectively apply analytic and predictive machine learning.


To communicate results using appropriate communication skills and visualization tools and techniques.





Learning objectives


Referring to knowledge

Getting to know the basic concepts in the area of natural language processing.


Discovering and using the basic natural language processing methods.



Teaching blocks


1. Overview of the course

2. Representing text for machines, working with strings, formatting strings, useful functions for strings, parsing text documents and creating a corpus

2.1. How to format strings

2.2. Relevant functions for strings

2.3. Parse text documents

3. Regular expressions

3.1. How to define regular expressions for extracting relevant information

3.2. Regular expressions for tokenizing sentences

4. Language models and edit distance

4.1. How can we define metrics over strings?

4.2. Detecting similar words to a misspelled word

4.3. Fast retrieval of similar words: BK tree datastructure, Using Cython to speed up distances

4.4. NLTK

5. Document representations for classification

5.1. Scipy.sparse matrices, efficient ways to operate with sparse matrices

5.2. Using machine learning models such as  Logistic regression, Perceptron, Support Vector Machines and Multilayer Perceptron

5.3. Using Pipelines to group and train machine learning and preprocessing steps

6. Document representations for retrieval tasks

6.1. How can we represent documents in a way that is suitable to define a metric between documents?

6.2. Tf-idf feature vector

6.3. Techniques and data structures for fast retrieval of similar words

7. Clustering documents

7.1. Understand how to retrieve similar documents to a query document using a model

7.2. Group related documents using different approaches, k-means and mixture of gaussians

8. Sequence models for NLP taks. Named Entity Recognition and Part of Speech tagging

8.1. Understanding how to find the most likely hidden state sequence according to some scores. Use it to anonimize documents

8.2. Implementing a Hidden Markov model

8.3. Implementing a  Structured Perceptron model

9. Dense word representations for words (word2vec and similar models), understand how word embeddings work

9.1. Learn how to benefit from pretrained embeddings

9.2. Generate combined features representations of counts and word embeddings for better document classification

10. Recurrent neural networks for NLP tasks

10.1. How to implement models that use recurrent loops such as LSTM and GRU Cells



Teaching methods and general organization


Classes will have 2 parts. 


- Key conceptual ideas and basic theory will be explained in slides and blackboard

- Practical knowledge will be showed executing jupyter notebooks.




Official assessment of learning outcomes


The evaluation is based on: 2 deliverable projects (70%) and a final exam (30%)

  •  Projects count 70% of the finall mark.
    • There are two projects to be delivered (with code to replicate results). 
    • Projects will consist on working with a "challenge", students will be given a task (dataset and metric) and they will have some freedom to explore different strategies to solve the task.
  • The final exam counts 30% of the final mark. 
    • Final exam will evaluate all the material in the course.



Reading and study resources

Consulteu la disponibilitat a CERCABIB


Foundations of Statistical Natural Language Processing (Christopher D. Manning and Hinrich Schütze)

Neural Network methods in natural language processign (Yoav Goldberg, Graeme Hirst)

Deep Learning in natural language processing (Li Deng, Yang Liu)

Regex Quick Syntax Reference (Zsolt Nagy)