Man page of lidc(1)
Index
- NAME
- SYNOPSIS
- DESCRIPTION
- OPTIONS
- DIAGNOSTICS
- EXAMPLES
- RESTRICTIONS
- NOTES
- SEE ALSO
- COPYRIGHT AND LICENSE
NAME
lidc - identifies language and character encoding of textual input
SYNOPSIS
lidc -i PATH -t TYPE -f FMT_STR -v -h
DESCRIPTION
lidc identifies language and character encoding of textual input.
lidc reads its input either from file or from stdin and can handle various input types: plain text, HTML, XML and email (MIME 1.0 and RFC822).
The results are displayed according to a user-definable format string that allows a broad range of customization.
For a list of supported languages and encodings have a look at the user manual.
OPTIONS
- -i PATH
-
Set the input file, "-" denotes stdin. (Default: stdin).
- -t TYPE [txt, html, xml, email]
-
Set the input file's type. The following formats are supported:
txt(default)-
Plain text document (without markup).
html-
Any HTML document (X-HTML, HTML 4, ...).
xml-
Any XML document.
email-
Any email, either conforming to RFC 822 or MIME (as specified by RFC 2045-2049, 2387, 1847 or 3462). See RESTRICTIONS below.
If no TYPE is set and lidc is reading from file (-i), lidc tries to determine the file's type automatically by evaluating the file's extension. The commonly used extensions (.txt, .html, .htm, .xml, .eml) are supported as well as all Maildir extensions and keywords as used by the Dovecot IMAP server.
If no type can be determined and no type is set, "txt" is assumed as default.
- -f FMT_STR
-
Set the output format string. You may customize the output format string as needed. The following flags are provided and replaced with the associated results in the output:
%l-> identified language-
%lexpands to the English name of the identified language, i.e. "German", "French" or "Swedish". %i-> ISO 639-3 language code-
%iexpands to the ISO 639-3 code of the identified language, i.e. "deu", "fra" or "swe". %e-> identified encoding-
%eexpands to the identified encoding, i.e. "UTF-8", "ISO-8859-1", "UTF-32LE" or "Windows-1252". %d-> declared document encoding-
%dexpands to the declared document encoding in lowercase letters, i.e. "utf-8", "iso-8859-1", "utf-32le" or "windows-1252".If no document encoding could be determined or the document type does not support encoding declarations (
txt),%dexpands to "none". %f-> input file's name-
%fexpands to the input file's name or to stdin.
Beside the above flags, the common escape sequences
\n(newline),\r(carriage-return),\t(tab) and\a(bell) are supported.If no format string is set, the default output is: "%l, %i, %e\n"
The output is sent to stdout.
- -v
-
Show version information.
- -h
-
Show a short help text.
DIAGNOSTICS
If an error occurs, the application terminates with error code 1 and prints an error message to stderr.
Additionally there are several possible warnings that may be printed to stderr if necessary.
EXAMPLES
Using lidc to identify language and encoding of a plain text file. The default output format string is used:
Shell $ lidc -i danish.txt
Danish, dan, UTF-32BE
Using lidc to identify language and encoding of an email. The correct
type, email, is automatically determined by evaluating the file's
extension.
Shell $ lidc -i german.eml
German, deu, ISO-8859-1
Same as above, but utilizing a pipe. The type has to be set in order to
prevent lidc from using the default type, txt.
Shell $ cat german.eml | lidc -t email
German, deu, ISO-8859-1
Processing an UTF-32 encoded XML file and setting a custom format string (including the declared document encoding):
Shell $ lidc -i hungarian.xml -f "%f: %l, %e, %d\n"
hungarian.xml: Hungarian, UTF-32LE, utf-32le
A more complex example, providing basic XML output:
XML/HTML $ lidc -i german.eml -f \ "<email>\n\t<lang>%l</lang>\n\t<enc>%e</enc>\n</email>\n" <email> <lang>German</lang> <enc>ISO-8859-1</enc> </email>
RESTRICTIONS
- o
-
There is no support for UTF-16 or UTF-32 encodings in emails.
- o
-
Concerning MIME emails, only the following media types are supported:
- x
-
text/plain
- x
-
text/html
- x
-
message/rfc822
- x
-
multipart/mixed
- x
-
multipart/alternative
- x
-
multipart/digest
- x
-
multipart/parallel
- x
-
multipart/related
- x
-
multipart/signed
- x
-
multipart/report
NOTES
The declared and the identified encoding may differ. This need not be a failure or a problem. Nevertheless it may give a hint on a problem. To give two examples:
1. If the declared encoding is ISO-8859-1 and the identified encoding is ASCII, this will in most cases be correct as the actually used characters may all be in the ASCII range and ISO-8859-1 is a superset of ASCII.
2. If the declared encoding is UTF-8 and the identified encoding is ISO-8859-1 this may be a hint on a problem. For example if an HTML document declares to be UTF-8 but it actually is not, this may cause the site to appear with "broken" characters.
SEE ALSO
User Manual (English version), Benutzerhandbuch (German version)
COPYRIGHT AND LICENSE
Copyright (c) 2009-2011 Lingua-Systems Software GmbH


