logo  polski español français english
Abbreviations Statistics Bibliography Links Legal notice Editors/Contact

word or its beginning (without * ? %) show the results in: show:
results per page
Dear users,

We give at your disposal the first parallel and multilingual corpus on-line which comprises at the same time the languages Basque, Polish and English. For the moment, it is still an experimental version of the tool, nevertheless after having successfully performed preliminary tests on a probationary linguistic material, we decided to put it on-line. Currently the corpus functions with a simplified search engine, however even this one makes it possible to obtain interesting results.

A little bit of theory

A corpus is a structured and possibly large set of “texts which are used for linguistic investigations, such as determining the frequency of occurrences of words, syntactic constructions, the contexts wherein given words appear. A more recent application of the corpora consists in teaching the computers during the processing of natural languages.” [1]. Moreover, “the data extracted from corpus [...] is used to compile dictionaries, thesauri, glossaries and is useful during the teaching of vocabulary of foreign languages, [...] the tools of vocabulary extraction from corpus [...] allow the use of the data during the translation, both carried out by a translator (Computer-Aided Translation/CAT) and automatic (Machine Translation/MT) [...]” [2].

How does it work?

For the moment, the corpus has only very simple search engine: in the first column of the form you choose the language and enter first letters of the word you seek (without additional characters * nor % at the end). As a result, you will obtain all the records containing the words which start with the desired sequence of letters. For example, after having entered white with the English language chosen, you will obtain the records which contain the words: white and whites, but also the words as whiten, whitened, whited, whiting and whiteness. One has to pay attention to the fact that the more initial letters are entered, more limited will be results of the research.

The option second language makes it possible to restrict the results of research to the records containing the desired sequences put in two fields (languages) of the data base at the same time (logical operator AND). For example, you can seek only the records wherein the English text contains the word woman while its Basque equivalent the word emakume. This option allows you also to seek the records which contain two different sequences in the same field (the same language), for example: after having chosen the English language twice and having entered woman in the first field and the man in the other one, you will obtain only the quotations which contain these two words at the same time.

In order to improve the legibility of the results page you can use the option show the results in which enables to choose only the languages of your interest. The quotations in other languages will be hidden.

The last option makes it possible to determine the number of records shown per page (show ... results per page).

The search preferences are saved and there is no need to adjust them every new research. They are lost only when you leave the page. However the language preferences are stored in a cookie file and this is why they will be restored automatically every time you return to the corpus page.

We consider:
  • to extend the base of texts until 1 million words in each language is reached,
  • to add new types of texts: dramas, philosophical treaties, scientific works and fragments of the Bible,
  • to equip the corpus with the lemmatization modules, initially for the Polish and Basque languages, then also for other languages,
  • to add the German language,
  • to extend the possibilities of research.

_________

SOURCES:

[1]Wikipedia.
[2]Lewandowska-Tomaszczyk Barbara (red.), Podstawy językoznawstwa korpusowego, Wydawnictwo Uniwersytetu Łódzkiego, Łódź 2005.