Computing with Language: Texts and Words
1.1 Getting Started with Python
1.2 Getting Started with NLTK
How to download corpora?
import nltk nltk.download()
How to load corpora?
from nltk.book import *
1.3 Searching Text
What is a concordance view?
A concordance view displays every occurrence of a given word in its context (= preceding and following words).
What other words appear in a similar range of contexts?
How to examine shared contexts between words?
What's a dispersion plot? How to obtain it?
A dispersion plot displays the locations of a word in the text,
each stripe represents an instance of a word.
text.dispersion_plot(['citizens', 'democracy', 'freedom', 'duties', 'America'])
1.4 Counting Vocabulary
What's a token?
A sequence of characters: a word and/or a punctuation symbol.
How to obtain the number of tokens?
What's the vocabulary of a text?
A vocabulary is the set of tokens contained in a text.
How to sort an array in Python?
What's a word type?
A word considered to be an unique item within a given vocabulary.
How do you quantify the lexical richness of a text?
Divide the number of distinct words over the total number of words. The result is a percentage of distincts words.
len(set(text)) / len(text)
How to count occurrences of a specific word within a text?
How to create a function in Python?
def function_x(param): return true
A Closer Look at Python: Texts as Lists of Words
How to create a list in Python?
[‘’, ‘’ etc.]
How to concatenate lists in Python?
[ ] + [ ]
How to append an element to a list in Python?
2.2 Indexing lists
What's an index?
It's the position of an item inside an array/list.
How to access an index from a value?
How to access a value given an index?
What is slicing?
It's retrieving a subpart of an array/list.
text[index1:index2] ; text[index1:] ; text[:index2]
What's a string?
Strings are lists of characters, so a string shares the same properties than a list.
How to convert a list to a string?
How to convert a string to a list?
Computing with Language: Simple Statistics
3.1 Frequency Distributions
What is a frequency distribution?
It's a matrix where each row represents the frequency of a vocabulary item in a given text.
fdist = FreqDist(text)
How to get the most frequent tokens?
fdist.most_common(<number of tokens to get>)
How to obtain a cumulative frequency plot?
A cumulative frequency plot tells us what proportion of a text is taken by the most common tokens:
What's an hapaxe?
An hapaxe is word that occur only once in a text. Hapaxes are considered as outliers in data analysis, and thus not generally useful.
3.2 Fine-grained Selection of Words
How to operate a fine-grained word selection by word length and frequency?
Obtain words which are at least 7 character long and that appear at least 7 times in the text:
w for w in set(text) if len(w) > 7 and fdist[w] > 7
The result is useful to identify key words in a text content-wise.
3.3 Collocations and Bigrams
What's a bigram?
A pair of words.
list(bigrams(['more', 'is', 'said', 'than', 'done']))
What is a collocation?
A sequence of words that occur together unusually often, and which are resistant to substitution with words that have similar meanings. A collocation is a frequent bigram.
3.4 Counting Other Things
How to get the frequency distribution of the different word lengths?
fdist = FreqDist(len(w) for w in text)
How to obtain the max value in a list?
How to access a given frequency in a frequency distribution?
Back to Python: making decisions and taking control
How to create an if statement in Python?
if len(word) < 5: ... print('word length is less than 5') elif token.istitle(): ... print(token, 'is a titlecase word') else: ... print(token, 'is punctuation')
How to create a loop in Python?
for word in ['Call', 'me', 'Ishmael', '.']: ... print(word)
How to operate on every element of a loop?
[function(w) for w in text]
Automatic Natural Language Understanding
What is Word Sense Disambiguation?
It's an area of NLP where we want to discover the intended meaning of a word in a given context.
What is Pronoun Resolution
It's about detecting the subjects and objects of verbs, finding the antecedents of a word.
What is Anaphora Resolution?
It's a part of pronoun resolution where we identify what a pronoun or noun refers to.
What is Semantic Role Labeling?
It's about identifying how a noun relates to the verb. Also a part of Pronoun Resolution.
What is Text Alignment?
It's a program automatically pairing up sentences. Once we have a million or more sentence pairs, we can detect corresponding words and phrases, and build a model that can be used for translating new text for example.
What is a Spoken Dialogue System?
It's a pipeline of language understanding components to generate a speech answer to an audio question.
What is RTE (Recognizing Textual Entailment)?
It's a challenge in language understanding where you try to automatically verify an hypothesis from statements given previously.