NRC Inuktitut Search Engine

www.inuktitutcomputing.ca       contact


This original search engine for the inuktitut language allows for the retrieval of inuktitut text from inuktitut web pages, whatever the font and the character set used to display the syllabic characters in those pages. Moreover, for the user's convenience, the text to be searched for can be input in the most common syllabic fonts Nunacom, Prosyl and AiPaiNunavik, in Unicode syllabics, and also in the roman alphabet. Wildcards are also allowed, as well as the boolean operators AND, OR and NOT.

For more information about how to use the search engine, please visit this page.

To access the NRC Inuktitut Search Engine, please go to this address.

Why an Inuktitut Search Engine?

BECAUSE Unicode Unified Canadian Syllabary is not recognized everywhere

Several of the common search engines do not recognize the Unicode characters from the Unified Canadian Syllabary as word characters. If you attempt searching with Google or Yahoo! for the Unicode word for example, you will get nothing.

BECAUSE of the several different encodings

Originally, Inuktitut text was (and is still often) displayed with 7-bit (and 8-bit) fonts like ProSyl and Nunacom. Those fonts use the same character set, namely the ASCII character set, than the fonts everyone is used to, like Courier, Times, etc., but instead of displaying the letters we know, the codes of the character set are associated with inuktitut glyphs. For example, the code 70 in the font Nunacom represents the syllabic character  (vi); in the "english" fonts, it represents the alphabetic roman character 'F'. This means that an inuktitut word like (ilinniarvik - school) in Nunacom is in fact indexed by the common search engines as its "english"-code-equivalent wo8ix3F4.

Each so-called 'legacy' font has its own code-to-glyph association table, and although there may be similitudes between some fonts, there are also big differences. A number of inuktitut words do have the same code sequence in different fonts, but most words have different code sequences. For example, a word like is w6]vNw]/E/z5 in Nunacom, w6>vNw>/E/z5 in Prosyl, w6√Nw÷E/z5 in Naamajut, Žñ›¶ŽÎäÍö” in Aujaq2, w6Ïâ÷E/z5 in AiPaiNunavik, ... In order to search for that word with the existing search engines, one would have to search for all those code sequences and more - given that one knows of them in the first place.

BECAUSE certain codes are word delimiters

While words like the Nunacom word in the example above have code sequences that can be indexed by the common search engines, it is not the case of a great number of inuktitut words the code sequences of which contain codes that are not considered by those search engines as valid word characters but as word delimiters. Consequently, such inuktitut words are not indexed, and therefore cannot be retrieved with those search engines. Moreover, since some codes are considered as delimiters, a code sequence containing those delimiter codes will be interpreted not as one single word but as a set of words. For example, a search with Google at www.google.ca for the Nunacom inuktitut word , with its code sequence 'w6]vNw]/E/z5', returns 18300 hits; at www.google.com, Google returns 7470 hits. Because ']' and '/' are delimiters to Google, the pages returned are those all the "parts" w6, vNw, E, z5, and their upper-/lower-case equivalents, like the following:

... MFO@W&+VNW>B^5N.82PSE5M(G(.#XO?'JVS+S`^0++,#/EIULP1 ... R&JGR^OH+_L*0@<";+M4>!$
MLWJ8_.0MK^8.6,S8''W6=8M*I22580 ... R!\>="4'+=8O5,!I5[(H\ZC"]^I1KH8*/G M&N+E]Z5[?,6XEE4< ...

For the same word, the NRC Inuktitut Search Engine returns 1 hit !!!

BECAUSE of case-insensivity

The current search engines were developed bearing in mind languages (primarely English), that use an alphabet with upper and lower cases where the case does not convey any meaning at the lexeme level. For example, 'sky', 'SKY', 'Sky' are all the same word.The search is done with no regards to the case. Searching for any of those forms will result in pages containing any of those forms, independently of the case. So, an input like the Nunacom word which has the indexable code sequence wo8ix3F4 will return pages that contain not only that sequence, but also WO8IX3F4, Wo8iX3f4, and all the code sequences with the 'letter' codes in the lower case and in the upper case, that is, 32 different sequences. The problem is that although those sequences may correspond to the same word in languages with cases, they do not in Inuktitut with legacy fonts.  In our example,  WO8IX3F4 in Nunacom represents the Inuktitut word , and Wo8iX3f4, the word , that have nothing to do with the input query; they are not even Inuktitut words.