programs   research   people   events   directions
Linguistics > Research > Resources and Links
 

Resources and Links

Linguistic corpora

Jointly with the NU Library and related departments, the Linguistics Department holds a membership in the Linguistic Data Consortium (LDC), which provides access to a wide variety of text and speech corpora in many languages. In addition, a number of proprietary and specially licensed corpora are in use for research purposes. For instructions on accessing and using currently installed corpora, or for questions about obtaining corpora, follow this link.

The following corpora are currently available on fisher, the Linguistics corpus server:

Directory Corpus Name Description
ANCAmerican National Corpus"Large, multi-genre American English text collection"
Aix-MarsecAix-Marsec database"Spoken British English, annotated from phone to intonation"
brown-untaggedBrown Corpus"Francis & Kucera (1979) corpus, English written texts"
BU_RadioBoston University Radio News"Read speech, annoted from phone to intonation"
buckeyeBuckeye Corpus"Spontaneous American English speech, orthographically transcribed, labeled from phone to intonation"
CelexCELEX (Release 2)"Lexical information for English, German, Dutch"
celex-oldCELEX (Release 1)"Lexical information for English, German, Dutch (archival)"
challenger-rawChallenger-Rawtranscript of space shuttle Challenger commission
challenger-taggedChallenger-TaggedSyntactically annotated transcript of Challenger
childesCHILDESCHILDES database (n.b. not most current version)
elberfelder-bibleElberfelder Bibletext of Revidierte Elberfelder Bible (German)
gutenbergProject GuttenbergClassic texts (1993)
helsinkiHelsinki CorpusOld/middle/early modern English texts
hist-docsHistorical documents"Text of historical documents (e.g., magna carta)"
hkust_mandarin_1HKUST Mandarin Telephone TranscriptsOrthographically transcribed Mandarin conversational speech
hoosierHoosier Mental LexiconLexical database for American English
IndianEngIndian EnglishIndian English speech database
kolhapurKolhapur CorpusIndian English written texts
oxford-text-archiveOxford Text Archive"English texts, many historical"
ppcmePenn-Helsinki Parsed Corpus of Middle EnglishSyntactically parsed Middle English texts
rncRussian National CorpusRussian texts
rst_discourse_treebankRhetorical Structure Theory Discourse Treebank"WSJ articles, annotated with discourse structure"
spanishMadrid CorpusAnnotated Spanish text
susanneSUSANNEGeoffrey Sampson's annotated corpus of written english
Treebank3"Penn Treebank, release 3"Syntactically annoted written and spoken English texts
ZalizniakZalizniak DictionaryRussian-English dictionary
⇑top

Software

A variety of software for the analysis of spoken and written texts is available in the department's research labs. This includes Praat, a program for speech analysis and synthesis; a variety of part-of-speech taggers, morphological analyzers, parsers, and programs for the statistical analysis of written texts; site-licensed packages for large-scale modeling, visualization and numerical analysis, such as Mathematica, Matlab, SAS and SPSS. The use of these computational tools in student work is strongly encouraged at Northwestern.

In addition to these externally-developed packages, several software applications for audio processing and experimental design have been developed at Northwestern. For more information please see Chun Chan's website.

⇑top

Useful links

Institutional support

NU Office for Research

Office for the Protection of Research Subjects, Institutional Review Board

NU Information Technology

Social Sciences Computing Cluster

NU Academic Technologies

Humanities and Social Science Research Guide to Technical Resources

Some online databases

MRC psycholinguistic database

Neighborhood density

Nonword database

⇑top