How to Transliterate Russian Text

Transliterating a natural language text means converting it from one writing system to another by using a set of predefined character or character sequence mappings. These mappings may be context dependent, too.

Transliterations have been standardized for a variety of languages and writing systems by national and international organizations like ISO, DIN or GOST. This way, transliterated natural language text can easily be exchanged and interpreted by those not familiar with its native alphabet.

This posting shows you how to transliterate Russian text and introduces you to Lingua::Translit, an open source software solution providing a variety of standardized and commonly used transliterations - focussing Russian and its popular and wide-spread standardized transliterations ISO 9, DIN 1460 and GOST 7.79 as an example.

Lingua::Translit - An Open Source Transliteration Software

Lingua::Translit is, as an open source software, freely available for download and requires nothing but a standard Perl installation. Therefore it should be usable on any platform supported by Perl and is known to run on Linux, FreeBSD, Solaris as well as on all current Microsoft Windows systems and Cygwin. Installation instructions are provided within the distribution - have a look at the "README" file.

Lingua::Translit provides an easy to use command line frontend, translit, as well as a library that allows developers to embed its functionality in their own projects. However, the remainder of the posting focusses on using the command line application suited for users. Developers can obtain detailed information on the API from Lingua::Translit's man page. A web frontend is available as well that allows to experiment with the supported standards and Lingua::Translit's features without the need to install the software - this is particularly useful to get a first impression.

Supported Transliteration Standards

At the beginning, you may be interested in a list of supported transliteration standards, which can be easily obtained using translit. The most current list of supported transliteration standards can be found in Lingua::Translit's supported standards document. The following example shows the supported transliterations covered by an installation of version 0.18:

Shell$ translit -l
Transliterations supported by Lingua::Translit v0.18:

Common CES, not reversible, Czech without diacritics
Common Classical MON, reversible, Classical Mongolian
  Script to Latin
Common DEU, not reversible, German umlauts
Common POL, not reversible, Unaccented Polish
Common RON, not reversible, Romanian without diacritics
Common SLK, not reversible, Slovak without diacritics
Common SLV, not reversible, Slovenian without diacritics
DIN 1460 BUL, reversible, DIN 1460:1982, Cyrillic to
  Latin, Bulgarian
DIN 1460 RUS, reversible, DIN 1460:1982, Cyrillic to
  Latin, Russian
DIN 1460 UKR, reversible, DIN 1460:1982, Cyrillic to
  Latin, Ukrainian
DIN 31634, not reversible, DIN 31634:1982, Greek to Latin
GOST 7.79 RUS, reversible, GOST 7.79:2000, Cyrillic to
  Latin, Russian
GOST 7.79 RUS OLD, not reversible, GOST 7.79:2000, Cyrillic
  to Latin with support for Old Russian (pre 1918), Russian
GOST 7.79 UKR, reversible, GOST 7.79:2000, Cyrillic to
  Latin, Ukrainian
Greeklish, not reversible, Greeklish (Phonetic), Greek to
  Latin
ISO 843, not reversible, ISO 843:1997 TL (Type 1), Greek
  to Latin
ISO 9, reversible, ISO 9:1995, Cyrillic to Latin
Streamlined System BUL, not reversible, The Streamlined
  System: 2006, Cyrillic to Latin, Bulgarian

Transliteration of Russian: "ISO 9"

Assume you have an UTF-8 encoded Russian text, stored in a file "russian.txt" with a content like this (taken from the UDOHR):

Все люди рождаются свободными и равными в своем достоинстве и правах. Они наделены разумом и совестью и должны поступать в отношении друг друга в духе братства.

These Cyrillic lines may be transliterated to the Latin alphabet according to ISO 9 by translit using the following command, issued either on a Unix terminal or Windows prompt:

Shell$ translit -t "ISO 9" -i russian.txt

Output:

Vse lûdi roždaûtsâ svobodnymi i ravnymi v svoem dostoinstve i pravah. Oni nadeleny razumom i sovestʹû i dolžny postupatʹ v otnošenii drug druga v duhe bratstva.

The "-t" switch is used to chose a transliteration standard, while "-i" instructs translit to read the file given as an argument and use it as an input for transliteration. Other important switches include "-o" which allows to write the transliterated text to a file rather than printing it on the terminal, "-v" that enables some verbose status messages and "-r" which enables reverse transliteration if the chosen standard supports it. For a complete description of available switches, have a look at translit's man page. The following example shows how to use this switches:

Shell$ translit -t "ISO 9" -i russian.txt -o iso9.txt -v
Reading input from russian.txt...
Writing output to iso9.txt...
Transliterating according to ISO 9...
$ translit -t "ISO 9" -i iso9.txt -r -v
Reading input from iso9.txt...
Writing output to STDOUT...
Transliterating according to ISO 9 (reverse)...

Output:

Все люди рождаются свободными и равными в своем достоинстве и правах. Они наделены разумом и совестью и должны поступать в отношении друг друга в духе братства.

As you see, any supported reverse transliteration is able to losslessly convert a transliterated text back to its original representation by evaluating the set of transliteration rules in reverse direction.

Transliteration of Russian: "DIN 1460 RUS"

Let's convert the text using "DIN 1460 RUS" as well, which represents the Russian subset of the transliteration rules provided by DIN 1460:

Shell$ translit -t "DIN 1460 RUS" -i russian.txt

Output:

Vse ljudi roždajutsja svobodnymi i ravnymi v svoem dostoinstve i pravach. Oni nadeleny razumom i sovest'ju i dolžny postupat' v otnošenii drug druga v duche bratstva.

DIN 1460 also provides language specific transliteration rules for some other languages using the Cyrillic alphabet as well, including Bulgarian ("DIN 1460 BUL") and Ukrainian ("DIN 1460 UKR").

Transliteration of Russian: "GOST 7.79 RUS"

GOST 7.79 defines transliteration rules for a wide set of languages using the Cyrillic alphabet, including Russian, Ukrainian and Bulgarian. In contrast to ISO 9 and DIN 1460, the mappings do not require any diacritic characters and the transliterated Russian text can therefore be encoded in plain ASCII. This may be particularly useful when transliterated text is be used in an environment sensible to encoding problems.

Shell$ translit -t "GOST 7.79 RUS" -i russian.txt

Output:

Vse lyudi rozhdayutsya svobodny'mi i ravny'mi v svoem dostoinstve i pravax. Oni nadeleny' razumom i sovest`yu i dolzhny' postupat` v otnoshenii drug druga v duxe bratstva.

Lingua::Translit's "GOST 7.79 UKR" table provides support for the Ukrainian mappings of GOST 7.79 as well.

Conclusion

The task of transliteration can easily be achieved by both users and software developers using the Perl module Lingua::Translit and its command line application translit. The broad range of supported transliteration standards allows to chose a standard that fits the actual needs best. As Lingua::Translit is actively developed and supported, the set of supported standards is extended continuously and (as of version 0.18) already includes 18 transliterations covering the Cyrillic, Greek, Mongolian and Latin writing system.

These attributes make Lingua::Translit the perfect transliteration solution for private and academic as well as for business use.

Posted 2010-07-30 11:58   by Alex Linke   Link: Permalink
Tags: Lingua::Translit  Perl  language  transliteration  software