Resources and Links: Department of Linguistics

Resources and Links

Linguistic corpora

Jointly with the NU Library and related departments, the Linguistics Department holds a membership in the Linguistic Data Consortium (LDC), which provides access to a wide variety of text and speech corpora in many languages. In addition, a number of proprietary and specially licensed corpora are in use for research purposes. For instructions on accessing and using currently installed corpora, or for questions about obtaining corpora, follow this link.

Linguistics Corpora Table
Directory	Corpus Name	Description
ANC	American National Corpus	"Large, multi-genre American English text collection"
Aix-Marsec	Aix-Marsec database	"Spoken British English, annotated from phone to intonation"
brown-untagged	Brown Corpus	"Francis & Kucera (1979) corpus, English written texts"
BU_Radio	Boston University Radio News	"Read speech, annoted from phone to intonation"
buckeye	Buckeye Corpus	"Spontaneous American English speech, orthographically transcribed, labeled from phone to intonation"
Celex	CELEX (Release 2)	"Lexical information for English, German, Dutch"
celex-old	CELEX (Release 1)	"Lexical information for English, German, Dutch (archival)"
challenger-raw	Challenger-Raw	transcript of space shuttle Challenger commission
challenger-tagged	Challenger-Tagged	Syntactically annotated transcript of Challenger
childes	CHILDES	CHILDES database (n.b. not most current version)
elberfelder-bible	Elberfelder Bible	text of Revidierte Elberfelder Bible (German)
gutenberg	Project Guttenberg	Classic texts (1993)
helsinki	Helsinki Corpus	Old/middle/early modern English texts
hist-docs	Historical documents	"Text of historical documents (e.g., magna carta)"
hkust_mandarin_1	HKUST Mandarin Telephone Transcripts	Orthographically transcribed Mandarin conversational speech
hoosier	Hoosier Mental Lexicon	Lexical database for American English
IndianEng	Indian English	Indian English speech database
kolhapur	Kolhapur Corpus	Indian English written texts
oxford-text-archive	Oxford Text Archive
nxt_switchboard_ann	NXT Switchboard Annotations	Linked annotations for Switchboard corpus: syntactic structure, disfluencies, phonetic transcripts, noun phrase animacy, word timing information, focus/contrast and prosodic structure, phone/syllable alignment, information structure.
ppcme	Penn-Helsinki Parsed Corpus of Middle English	Syntactically parsed Middle English texts
rnc	Russian National Corpus	Russian texts
rst_discourse_treebank	Rhetorical Structure Theory Discourse Treebank	"WSJ articles, annotated with discourse structure"
spanish	Madrid Corpus	Annotated Spanish text
susanne	SUSANNE	Geoffrey Sampson's annotated corpus of written english
Treebank3	"Penn Treebank, release 3"	Syntactically annoted written and spoken English texts
Zalizniak	Zalizniak Dictionary	Russian-English dictionary

Software

A variety of software for the analysis of spoken and written texts is available in the department's research labs. This includes Praat, a program for speech analysis and synthesis; a variety of part-of-speech taggers, morphological analyzers, parsers, and programs for the statistical analysis of written texts; site-licensed packages for large-scale modeling, visualization and numerical analysis, such as Mathematica, Matlab, SAS and SPSS. The use of these computational tools in student work is strongly encouraged at Northwestern.

In addition to these externally-developed packages, several software applications for audio processing and experimental design have been developed at Northwestern. For more information please see Chun Chan's website.

DEPARTMENT OF LINGUISTICS