Notes on "Natural Language Toolkit" - Chapter 2: Accessing Text Corpora and Lexical Resources

  1. Accessing Text Corpora

1.1 Gutenberg Corpus

What is a corpora?

A large structured collection of texts.

How to get the text list of a corpus?


How to access a default corpus in NLTK?

from nltk.corpus import gutenberg

How to obtain the content of a file without any linguistic processing (not split up into tokens)?


How to divide the text up into its sentences?


1.2 Web and Chat Text

How to access default web texts in NLTK?

from nltk.corpus import webtext

How to access default chat conversations in nltk?

from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml')

1.3 Brown Corpus

What is stylistics?

The study of systematic differences between genres. Word counts might distinguish genres: the most frequent modal in the 'news' genre is 'will', while the most frequent modal in the 'romance' genre is 'could'.

What is the Brown Corpus?

A convenient resource for studying systematic differences between genres:

from nltk.corpus import brown

1.4 Reuters Corpus

What is the Reuters Corpus?

For training and testing algorithms that automatically detect the topic of a document. Text categories in the Reuters corpus overlap with each other.

from nltk.corpus import reuters

1.5 Inaugural Address Corpus

What is the Inaugural Address Corpus?

A temporal corpus representing language uses over time.

from nltk.corpus import inaugural

1.6 Annotated Text Corpora

How to get a list of all NLTK corpus?


1.7 Corpora in Other Languages

Universal Declaration of Human Rights in over 300 languages

from nltk.corpus import udhr

1.8 Text Corpus Structure

How to access the categories of a corpus?


How to list the words contained in the corpus?


1.9 Loading your own Corpus

How to load your own Corpus?

from nltk.corpus import PlaintextCorpusReader
corpus_root = '/usr/share/dict'
wordlists = PlaintextCorpusReader(corpus_root, '.*')


from nltk.corpus import BracketParseCorpusReader
corpus_root = r'C:\corpora\penntreebank\parsed\mrg\wsj'
file_pattern = r'.*/wsj_.*\.mrg'
ptb = BracketParseCorpusReader(corpus_root, file_pattern)