Le Nouveau Corpus d'Amsterdam (NCA)


Le Nouveau Corpus d'Amsterdam

The original corpus: The Amsterdam Corpus of Old French Literary Texts was compiled at the beginning of the 1980s by a group of scholars directed by Anthonij Dees and resulted in the Atlas des formes linguistiques des textes littéraires de l'ancien français (1987). The electronic version of the texts was provided by Piet van Reenen (Free University of Amsterdam). It contains about 200 different texts, written between the beginning of the 11th and the end of the 14th century, some of them in several versions, which adds to a total of almost 300 text samples with more than three million words (tokens).
These forms had been manually annotated by Dees' team with a set of 225 numeric tags encoding part of speech and other morphological categories (e.g. "566" for verb, futur tense, 3rd person, plural). Some of the texts are electronic versions of existing editions (e.g. the Miracles de Notre Dame de Chartres by Jean le Marchant, edited by P. Kunstmann, Chartres/Ottawa, 1973), others are transcriptions of manuscripts made especially for this corpus. The original texts were not lemmatized. They are nevertheless a precious resource which enabled us to extract a lexicon of more than 130.000 Old French inflected forms and to train the part of speech tagger.

"Le Nouveau Corpus d'Amsterdam" (NCA): The new version of the corpus edited (revised, lemmatized, XML-formatted) by Pierre Kunstmann and Achim Stein was presented at the Lauterbad Workshop in February 2006 (see Kunstmann/Stein 2007 below). The corpus, the lexical resources used for the annotation and the documentation are available free of charge for non-commercial, non-profit research purposes for registered users who have sent the signed license agreement (PDF) to this address: Prof. Dr. Achim Stein, Institut für Linguistik/Romanistik, Universität Stuttgart, Keplerstraße 17, 70174 Stuttgart, Germany.

History:

  • 31.05.2018: v3 is also provided for TXM corpus software
  • 19.03.2011: v3 installed, with updated bibliography
  • 22.03.2010: v2 is also provided for TIGERSearch
  • 2008: v2 for download, queries online (with TWIC and TWICweb)
  • 2006: v1 for download

Please quote the corpus versions as indicated below. If you refer specifically to the bibliographical information, please refer to:

[Glessgen/Vachon:2010] Gleßgen, Martin-Dietrich & Vachon, Claire (2010): Répertoire bibliographique du Nouveau Corpus d'Amsterdam, établi par Anthonij Dees et Piet Van Reenen (Amsterdam 1987), revu et élargi par M.-D.G. et C.V., 3. ed., Stuttgart: Institut für Linguistik/Romanistik. 

For more Information see:

[Stein 2010] Stein (2010): Outils et méthodes pour l'annotation des textes médiévaux. (Slides of a talk given at the Sorbonne, Paris, March 2010). [PDF]

[Glessgen/Vachon:2011] Gleßgen, Martin-Dietrich & Vachon, Claire (to appear): "L'étude philologique et scriptologique du Nouveau Corpus d'Amsterdam" - Casanova, Emili / Calvo, Cesáreo (éds.): Actes du XXVI CILPR, València 6-11 septembre 2010, Berlin: De Gruyter. [PDF]

[Glessgen/Gouvert:2007] Gleßgen, Martin-Dietrich & Gouvert, Xavier (2007): "La base textuelle du Nouveau Corpus d'Amsterdam: ancrage diasystématique et évaluation philologique" - Kunstmann, Pierre & Stein, Achim (ed.): Le Nouveau Corpus d'Amsterdam. Actes de l'atelier de Lauterbad, 23-26 février 2006, Stuttgart: Steiner, 51-84.

[Kunstmann/Stein:2007a] Kunstmann, Pierre & Stein, Achim (2007): "Le Nouveau Corpus d'Amsterdam" - Kunstmann, Pierre & Stein, Achim (ed.): Le Nouveau Corpus d'Amsterdam. Actes de l'atelier de Lauterbad, 23-26 février 2006, Stuttgart: Steiner, 9-27 (ISBN 978-3-515-08997-5). [Introduction to the NCA: PDF]


For registered NCA users

Most of the links in the following section will require a user name and password. Access is free, but requires a license agreement (PDF).


Version 3.0 (2011-2018)

1. TWIC Online Search for the NCA:

Open TWIC in a new window.

2. Use the NCA with TXM for local installation

Use the TXM corpus software developed at the ENS de Lyon. It provides a well documented graphical search interface for the powerful CQP (Corpus Workbench) query processor. Among many other functions, it also allows you to easily create subcorpora by selecting individual texts according to the bibliographical information.

  • Download and install the TXM corpus software by following the instructions given on the Textométrie web site. TXM comes with a ‘normal’ installer for different systems (Windows, Mac, etc.).
  • Download the zipped archive NCA for TXM (about 350 MB). You don’t need to unzip the archive!
  • In TXM: Menu File - Load
  • In the ‘Open file’ window: select the zipped archive file (the one you downloaded) and open it.
  • TXM will unpack and install the corpus.
    • This may take a couple of minutes and requires about 3,5 GB space on your disk.
    • TXM will put the corpus in the ‘corpora’ folder under the ‘TXM’ folder it created (by default) in your home folder.
    • There (probably in corpora/nca3/txm/NCA3) you will also find (more or less) readable XML files for each text. You may re-use them, but use copies and don’t modify TXM’s subfolders.
  • If everything works fine in TXM, you may delete the downloaded zipped archive file.

3. Use the NCA with TIGERSearch for local installation

TIGERSearch is not supported anymore by IMS Stuttgart. But it is still freely available and some people still seem to work on the sources. It is a great query software, provides a graphical user interface and runs on any system that provides Java or a Java runtime environment (Windows, Mac OS X, Linux and other Unix systems). On the downside, it may be more difficult to install on some systems, and since it was developed for syntactically annoted corpora, queries for corpora that are ‘only’ annotated at word level (part of speech, lemmas) are a bit less straightforward than in other programmes. If you don’t already know TIGERSearch I recommend to try TXM first.

For TIGERSearch, follow these steps (you may need to adapt some of them to your specific system):

  • Download TIGERSearch from the TIGERSearch home page (IMS, Stuttgart) (the link is at the bottom of the page)
  • Unpack the archive and move the complete folder (as of 2018 ‘TIGERSearchTools’) somewhere on your disk.
    • Some sample corpora are pre-installed in the subfolder ‘CorporaDir’.
  • Use one of the batch files (Windows) / shell scripts (Mac) to start TIGERSearch
  • Download the NCA for TIGERSearch (zip archive): nca3-for-tiger.zip (ca 41MB)
    • Unpack the downloaded zip file: it contains a folder "NCA3"
    • Move this folder into the "CorporaDir" folder (mentioned above).
  • You will find a quick start guide for using TIGERSearch with the NCA on my homepage, in the section Resources (or use this direct link). TIGERSearch includes a help function and a pdf manual (see help section III for a general description of the query language).

3. Download the NCA corpus with TWIC for local installation

Download this ZIP Archive to install TWIC on your computer (Windows, Linux, Mac OS X). It includes the TWIC Perl programme, documentation for the different operating systems, and a sample corpus taken from the NCA. Open the ZIP Archive and read the included PDF document "TWIC installation".

Once TWIC is installed, you can replace the sample corpus by the entire NCA corpus (download nca3.xml.gz , 2,5 MB). You can install and search your own corpora: read the section about the configuration file. Even if you don’t intend to use TWIC, the XML file containing the complete corpus may be useful for other purposes.

4. Documentation and Bibliography for this version

Please follow the link for the online query above. On the query form, click on "corpus information window", where you will find links to the bibliography in various formats.

Please refer to this version as:

Stein, Achim et al. (ed.): Nouveau Corpus d'Amsterdam. Corpus informatique de textes littéraires d'ancien français (ca 1150-1350), établi par Anthonij Dees (Amsterdam 1987), remanié par Achim Stein, Pierre Kunstmann et Martin-D. Gleßgen, Stuttgart: Institut für Linguistik/Romanistik, version 3, 2011.

Changes:

  • The bibliography has been revised considerably (by the Zurich group: Martin-D. Gleßgen and Claire Vachon).

Version 2.0 (2008, updated 2010)

(Information provided for older versions is not updated. )

1. TWIC Online Search for the NCA:

Open TWIC in a new window.

2. Download the NCA corpus with TWIC for local installation

Download this ZIP Archive to install TWIC on your computer (Windows, Linux, Mac OS X). It includes the TWIC Perl programme, documentation for the different operating systems, and a sample corpus taken from the NCA. Open the ZIP Archive and read the included PDF document "TWIC installation".

Once TWIC is installed, you can replace the sample corpus by the entire NCA corpus (download nca2.xml.gz , 2,5 MB). You can install and search your own corpora: read the section about the configuration file.

3. Download the NCA corpus with TIGERSearch for local installation

Note that TIGERSearch is probably easier to install than TWIC (since it does not require the installation of a web server). It provides a graphical user interface and is available for Windows, Mac OS X, Linux and Solaris. Please follow these steps:

  1. Download TIGERSearch from the TIGERSearch Download page (IMS, Stuttgart)
  2. Download one of the corpus files in TIGER-XML format:
  3. Follow the instructions in the document The NCA for TIGERSearch (PDF)

4. Documentation and Bibliography for this version

Please refer to this version as:

Stein, Achim et al. (ed.): Nouveau Corpus d'Amsterdam. Corpus informatique de textes littéraires d'ancien français (ca 1150-1350), établi par Anthonij Dees (Amsterdam 1987), remanié par Achim Stein, Pierre Kunstmann et Martin-D. Gleßgen, Stuttgart: Institut für Linguistik/Romanistik, version 2, 2008.

Changes:

  • The Text La passion des jongleurs (id=jong) was updated: in version 1, the last word of each line was missing. (Thanks to Yves-Charles Morin for signalling this error).
  • The bibliography has been revised considerably (by the Zurich group: Martin-D. Gleßgen and his staff). The first entry of the bibliography (see links above) is a "comment entry" which briefly explains the meaning of the descriptors. These descriptors are values of the XML element "subcorpus" (using TWIC, you can therefore restrict your search to texts corresponding to these values, e.g. date ranges, regions, quality of the manuscript etc.).

Note that the bibliography is still work in progress. Updates will be published here.


Version 1.0 (2006)

Download the orginal distribution of the corpus:

The first version (1.0) of the corpus has been presented on a CD-Rom at the Lauterbad Workshop, February 2006. To reproduce it,

  1. create a directory on your local disk drive, e.g. "nca"
  2. download the files 00readme.txt, 00license.txt to "nca" (see below)
  3. download the following zip archives to "nca"
    • twic.zip, 27MB, (new corpus, TWIC search tool)
    • sofa.zip, 10MB, (documentation, original corpus, frequency lists...)
    • perl.zip, 13MB, (Active State Perl for Windows, required if you use TWIC, also available at www.activestate.com)
    • xaira.zip, 382MB, (corpus formatted for Xaira, Xaira for Windows, not required if you use TWIC)
    • tagger.zip, 4MB, (TreeTagger, parameters for Old French)
  4. unpack the archives (preserve the directory structure)
  5. follow the Installation Guide in 00readme.txt

Browse the documentation online:

The SOFA directory (Sources et Outils pour le français ancien): documentation, material and resources for the Nouveau Corpus d'Amsterdam.

Quote this version as:

Stein, Achim et al. (ed.): Nouveau Corpus d'Amsterdam. Corpus informatique de textes littéraires d'ancien français (ca 1150-1350), établi par Anthonij Dees (Amsterdam 1987), remanié par Achim Stein, Pierre Kunstmann et Martin-D. Gleßgen, Stuttgart: Institut für Linguistik/Romanistik, 2006.


Les chartes de l'Aube

  • Printed version: Pieter van Reenen, avec le concours de Evert Wattel et Margôt van Mulken: Champagne 1270-1300, Chartes en langue française conservées aux Archives de l'Aube, Orléans: Paradigme 2006.
  • Electronic version, provided by Piet van Reenen with the permission of the publisher: Zip archive, 145k