An English-Inuktitut Parallel Corpus

www.inuktitutcomputing.ca       contact


The Legislative Assembly of Nunavut publishes its Hansard in English and Inuktitut. Since these are direct translations of each other, they are an excellent candidate for a parallel corpus for Inuktitut-English. They graciously gave us access to this corpus for research purposes. Note that these copies have no official status as a Hansard. For that purpose you should refer to http://www.assembly.nu.ca.

Version 2.0 of this match was done in January 2008. It contains the proceedings of all the session-days from April 1, 1999 to November 8, 2007 (excluding 2003), which is the last day with available proceedings at that time. This version is the result of merging the aligned texts of version 1.1 for 1999-2002 with those that we have just aligned for 2004-2007. These have been prepared from the text extracted from the pairs of PDF documents - inuktitut and english - found on the site of the Legislative Assembly of Nunavut. The alignment was done with Moore's programs based on the Gale and Church algorithm.

Download Version 2.0 (gzipped 21 MB)

Version 1.1 of this match was done on June 3, 2003. It is based on 155 session-days of Hansard that we received in Word format from the Nunavut Legislative Assembly. The alignment was done as for Version 1.0 with the following improvements:

Download Version 1.1 (gzipped 14 MB)

Version 1.0 of this match was done in April 2003. It is based on 155 session-days of the Hansard that we received in Word format from the Nunavut Legislative Assembly. These were sentence-aligned them using a modification of the Gale-Church algorithm constrained within a match based on a number of lexical tokens. Details of this process are described in a paper entitled "Aligning and Using an English-Inuktitut Parallel Corpus" presented to the HLT-NAACL 2003 Workshop: Building and Using Parallel Texts Data Driven Machine Translation and Beyond.

The power-point presentation is also available: HLT power-point presentation.

Download Version 1.0 (gzipped 14 MB)

Searching for a word in the Nunavut Hansard

As a demonstration of how the Inuktitut-English parallel corpus can be used, we have developed a tool that allows to search for a phrase, a word or part of word - Inuktitut or English - in the parallel corpus and that shows the Inuktitut and English corresponding sentences that contain it. This search tool can be reached through the following link:  Searching the Nunavut Hansard.