Using Lingua::Lid in a Threaded Application

As of version 0.02 Lingua::Lid is thread-safe if compiled with a recent version of lid (3.0.0 or higher).

This allows you to safely call Lingua::Lid's language and charset identification functions, like lid_ffile and lid_fstr, simultaneously within your application by making use of Perl's ''threads'' module. As thread support in Perl is a compile time option, you will need a thread-enabled version of Perl as shipped by most modern Linux distributions like Debian Lenny or Ubuntu Lucid - or ActiveState's version for Windows.

Of course, you are free to use Lingua::Lid in any non-threaded code as well. Software written using Lingua::Lid v0.01 will stay functional without modification.

If you are not familiar with using Perl's threads, the perlthrtut tutorial and the threads module documentation are a good place to start.

The following example application, lingua-lid-thread-example.pl provides a basic example on how Lingua::Lid may be used in a threaded application. A set of files is given to the application as an argument. The application will then create $max_threads threads, each identifying the language and character encoding of one file, and print the results as long as there are no files to identify left.

Perl#!/usr/bin/perl -w

use strict;
use Config;

die "usage: $0 file(s)\n" unless scalar @ARGV;

## check whether the used version of Perl has been compiled
## with thread support
unless ($Config{useithreads})
{
    die "The used version Perl does not support threads!\n";
}

require threads;
require Lingua::Lid;

my $max_threads = 2;
my $nr = 0;

## while there are files given as arguments left...
while (@ARGV)
{
    my $file = shift(@ARGV);

    ## ...create a thread that identifies the file's language
    ## and charset and returns the determined results in
    ## scalar context when it is requested to join() to the
    ## main thread of control again.
    threads->create({ context => "scalar" }, sub {

        ## identify language and charset of the file using
        ## lid's lid_ffile
        my $res = Lingua::Lid::lid_ffile($file);

        return {
            file   => $file,

            ## $res will be undef if no result could be
            ## computed
            result => $res,

            ## in this case, Lingua::Lid::errstr() will
            ## return the error message reported by lid's
            ## lid_strerror() function.
            errstr => Lingua::Lid::errstr()
        };
    });

    ## if the maximum amount of concurrent threads has been
    ## reached or no files are left to identify, join all
    ## threads and print their results.
    if (scalar @ARGV % $max_threads == 0 || ! scalar @ARGV)
    {
        foreach my $thread (threads->list())
        {
            my $rv = $thread->join();

            printf("%02d: %s: %s\n",
                ++$nr,
                $rv->{file},
                $rv->{result} ?
                    join(", ", $rv->{result}->{language},
                               $rv->{result}->{isocode},
                               $rv->{result}->{encoding})
                    : "ERROR: $rv->{errstr}"
            );
        }
    }
}

Download the source code: Lingua-Lid-thread-example.pl

Please note that the package variable Lingua::Lid::errstr could have been used instead of Lingua::Lid::errstr(), too. Internally it is tied to Lingua::Lid::errstr() using Tie::Scalar and therefore thread-safe as well -- however, it is recommended to use the function to obtain an error message in any new code, because the package variable may be removed in a future release of Lingua::Lid because it -by concept- implies a lack of thread-safety.

Here is an example invocation using a set of text files in a variety of languages and charsets intermixed by some non existent or "special" files to demonstrate Lingua::Lid's error handling facilities.

Shell$ perl lingua-lid-thread-example.pl danish.txt \
   dutch.txt non-existent.txt english.txt /dev/null \
   french.txt german.txt /dev/zero swedish.txt
01: danish.txt: Danish, dan, UTF-8
02: dutch.txt: Dutch, nld, UTF-8
03: non-existent.txt: ERROR: Failed to open file
04: english.txt: English, eng, UTF-8
05: /dev/null: ERROR: Insufficient input length
06: french.txt: French, fra, UTF-8
07: german.txt: German, deu, UTF-8
08: /dev/zero: ERROR: Binary input data
09: swedish.txt: Swedish, swe, UTF-8
Posted 2010-06-21 09:12   by Alex Linke   Link: Permalink
Tags: Perl  lid  Lingua::Lid  language-identifier  language  charset  software