Skip to main content

Resources and Links

Linguistic corpora

Jointly with the NU Library and related departments, the Linguistics Department holds a membership in the Linguistic Data Consortium (LDC), which provides access to a wide variety of text and speech corpora in many languages. In addition, a number of proprietary and specially licensed corpora are in use for research purposes. For instructions on accessing and using currently installed corpora, or for questions about obtaining corpora, follow this link.

The following corpora are currently available on Box (formerly Babel). Please contact Chun Chan for access.

Directory

Corpus Name

Description

ANC

American National Corpus

"Large, multi-genre American English text collection"

Aix-Marsec

Aix-Marsec database

"Spoken British English, annotated from phone to intonation"

brown-untagged

Brown Corpus

"Francis & Kucera (1979) corpus, English written texts"

BU_Radio

Boston University Radio News

"Read speech, annoted from phone to intonation"

buckeye

Buckeye Corpus

"Spontaneous American English speech, orthographically transcribed, labeled from phone to intonation"

Celex

CELEX (Release 2)

"Lexical information for English, German, Dutch"

celex-old

CELEX (Release 1)

"Lexical information for English, German, Dutch (archival)"

challenger-raw

Challenger-Raw

transcript of space shuttle Challenger commission

challenger-tagged

Challenger-Tagged

Syntactically annotated transcript of Challenger

childes

CHILDES

CHILDES database (n.b. not most current version)

elberfelder-bible

Elberfelder Bible

text of Revidierte Elberfelder Bible (German)

gutenberg

Project Guttenberg

Classic texts (1993)

helsinki

Helsinki Corpus

Old/middle/early modern English texts

hist-docs

Historical documents

"Text of historical documents (e.g., magna carta)"

hkust_mandarin_1

HKUST Mandarin Telephone Transcripts

Orthographically transcribed Mandarin conversational speech

hoosier

Hoosier Mental Lexicon

Lexical database for American English

IndianEng

Indian English

Indian English speech database

kolhapur

Kolhapur Corpus

Indian English written texts

oxford-text-archive

Oxford Text Archive

nxt_switchboard_ann

NXT Switchboard Annotations

Linked annotations for Switchboard corpus: syntactic structure, disfluencies, phonetic transcripts, noun phrase animacy, word timing information, focus/contrast and prosodic structure, phone/syllable alignment, information structure.

ppcme

Penn-Helsinki Parsed Corpus of Middle English

Syntactically parsed Middle English texts

rnc

Russian National Corpus

Russian texts

rst_discourse_treebank

Rhetorical Structure Theory Discourse Treebank

"WSJ articles, annotated with discourse structure"

spanish

Madrid Corpus

Annotated Spanish text

susanne

SUSANNE

Geoffrey Sampson's annotated corpus of written english

Treebank3

"Penn Treebank, release 3"

Syntactically annoted written and spoken English texts

Zalizniak

Zalizniak Dictionary

Russian-English dictionary

 

Software

A variety of software for the analysis of spoken and written texts is available in the department's research labs. This includes Praat, a program for speech analysis and synthesis; a variety of part-of-speech taggers, morphological analyzers, parsers, and programs for the statistical analysis of written texts; site-licensed packages for large-scale modeling, visualization and numerical analysis, such as Mathematica, Matlab, SAS and SPSS. The use of these computational tools in student work is strongly encouraged at Northwestern.

In addition to these externally-developed packages, several software applications for audio processing and experimental design have been developed at Northwestern. For more information please see Chun Chan's website.

Useful links

Institutional support

Some online databases