Man page of tw-learn(1)

Dieses Dokument ist nur in englischer Sprache verfügbar.

Index


NAME

tw-learn - learns/unlearns documents of categories

SYNOPSIS

tw-learn [db-opt] [opt] -c CATEGORY file(s)

tw-learn [db-opt] [opt] -U -c CATEGORY file(s)

DESCRIPTION

tw-learn analyses input documents, extracts relevant information and updates a category's profile accordingly, so that similar documents can automatically be assigned to the most appropriate category afterwards.

If a document has been learned as an example of a category by mistake, tw-learn is able to "unlearn" this document from the category's profile as well. The database will automatically be optimized afterwards.

OPTIONS

DATABASE OPTIONS

These options are required in order to establish a connection to the Textweiser database. They can either be given on the commandline and/or be supplied in a configuration file (-f / --config).

NOTE: If Textweiser uses an SQLite database backend, only the -d / --db_name option is required and all other database options are not available.

-d / --db_name database name

Name of the Textweiser database (UTF-8 encoded).

If Textweiser uses an SQLite database backend, database name is the path to the database file, not necessarily UTF-8 encoded.

-s / --host hostname

Hostname of the database server.

-u / --user username

Username to connect to the database.

-w / --passwd password

Password to connect to the database.

NOTE: If no password is given as an argument on the commandline and no password is set in the configuration file, you will be prompted to enter the password. The password will not be echoed during input.

-p / --port port

Port of the database on hostname.

If port is not set, the default port of the database is assumed.

-t / --instance instance

Name of the Microsoft SQL Server instance on hostname.

NOTE: This option is only available if Textweiser uses the Microsoft SQL Server database backend.

-e / --encrypt

Request communication to the database to be encrypted. If no encrypted connection can be established by the database driver, Textweiser will abort.

--trust-cert

Request to trust any certificate presented by the database server, without validation.

NOTE: In order to use self-signed certificates, this option has to be enabled.

NOTE: Passing this option implicitly enables communication encryption.

The database configuration may be given in a configuration file as well. For details, see CONFIGURATION FILE SYNTAX below.

-f / --config path

Path to a Textweiser database configuration file.

COMMON OPTIONS

-v / --verbose

Enable verbose output.

-V / --version

Show version information and terminate.

-h / --help

Show a short help screen and terminate.

LEARNING OPTIONS

Every invocation of tw-learn works on a single document or a set of documents and updates the profile of a category accordingly.

-c / --cat category name

Name of the category the given documents are examples of (UTF-8 encoded).

MODE OPTIONS

If no mode option is given, tw-learn learns the given document or set of documents as examples of a category.

-U / --unlearn

Switches to unlearning mode. Documents are considered not being an example of the given category (-c / --cat).

DIAGNOSTICS

If an error occurs, the application terminates with an appropriate error code dependent on the operating system in use and prints an error message to stderr.

CONFIGURATION FILE SYNTAX

The syntax of Textweiser configuration files follows an easy to use key/value scheme. Empty lines and any leading/trailing whitespace is ignored. Lines starting with the character # are considered comments.

Values may be enclosed within matching single or double quotes and are assigned to keys using the = character.

SUPPORTED KEYS

host

Hostname of the database server.

user

Username for database authentification.

passwd

Password for database authentification.

db_name

Name of the Textweiser database.

port

Port number of the database server.

instance

Name of the Microsoft SQL Server instance.

encrypt

Enable/disable communication encryption.

The following values are recognized:

"yes" or "on"

Enable encryption.

"no" or "off"

Disable encryption.

In order to trust a server's certificate, append the "trust-cert" token to the value, separated by a comma and/or whitespace, i.e.

 encrypt = "yes, trust-cert"

NOTES

o

All category names and input documents have to be encoded in UTF-8.

o

On Microsoft Windows an option may be started by the "/" character as well.

o

On any Unix-like system the common sequence "--" terminates parsing of options.

o

It is recommended to train each category by learning from at least ten appropriate documents.

o

Any configuration file specified (-f / --config) is parsed and evaluated before other commandline arguments are evaluated. As a result, arguments given on the commandline overwrite settings given in a configuration file.

o

Communication encryption is the task of the database driver. Textweiser merely instructs the driver to enable or disable encryption according to the passed options and checks whether the operation did succeed.

EXAMPLES

For brevity, the following examples assume Textweiser is using the SQLite database backend and that a Textweiser database and a set of categories have already been created. See the EXAMPLES section of tw-admin(1) for details.

Learn a set of files that are examples of a "Sales" category:

Shell $ tw-learn -v -d example.sqlt -c Sales sales_1.txt sales_2.txt
 # Processing sales_1.txt... OK
 # Processing sales_2.txt... OK
 Learned 2 documents of category "Sales"

Unlearn a previously learned document from the "Sales" category:

Shell $ tw-learn -v -d example.sqlt -c Sales -U sales_2.txt
 # Processing sales_2.txt... OK
 Optimizing database
 Unlearned 1 document of category "Sales"

SEE ALSO

tw-admin(1), tw-classify(1)

Textweiser User Manual

http://www.lingua-systems.com/text-classifier/textweiser-library/