Man page of tw-learn(1)
Index
- NAME
- SYNOPSIS
- DESCRIPTION
- OPTIONS
- DIAGNOSTICS
- CONFIGURATION FILE SYNTAX
- NOTES
- EXAMPLES
- SEE ALSO
- COPYRIGHT
NAME
tw-learn - learns/unlearns documents of categories
SYNOPSIS
tw-learn [db-opt] [opt] -c CATEGORY file(s)
tw-learn [db-opt] [opt] -U -c CATEGORY file(s)
DESCRIPTION
tw-learn analyses input documents, extracts relevant information and updates a category's profile accordingly, so that similar documents can automatically be assigned to the most appropriate category afterwards.
If a document has been learned as an example of a category by mistake, tw-learn is able to "unlearn" this document from the category's profile as well. The database will automatically be optimized afterwards.
OPTIONS
DATABASE OPTIONS
These options are required in order to establish a connection to the Textweiser database. They can either be given on the commandline and/or be supplied in a configuration file (-f / --config).
NOTE: If Textweiser uses an SQLite database backend, only the -d / --db_name option is required and all other database options are not available.
- -d / --db_name database name
-
Name of the Textweiser database (UTF-8 encoded).
If Textweiser uses an SQLite database backend, database name is the path to the database file, not necessarily UTF-8 encoded.
- -s / --host hostname
-
Hostname of the database server.
- -u / --user username
-
Username to connect to the database.
- -w / --passwd password
-
Password to connect to the database.
NOTE: If no password is given as an argument on the commandline and no password is set in the configuration file, you will be prompted to enter the password. The password will not be echoed during input.
- -p / --port port
-
Port of the database on hostname.
If port is not set, the default port of the database is assumed.
- -t / --instance instance
-
Name of the Microsoft SQL Server instance on hostname.
NOTE: This option is only available if Textweiser uses the Microsoft SQL Server database backend.
- -e / --encrypt
-
Request communication to the database to be encrypted. If no encrypted connection can be established by the database driver, Textweiser will abort.
- --trust-cert
-
Request to trust any certificate presented by the database server, without validation.
NOTE: In order to use self-signed certificates, this option has to be enabled.
NOTE: Passing this option implicitly enables communication encryption.
The database configuration may be given in a configuration file as well. For details, see CONFIGURATION FILE SYNTAX below.
- -f / --config path
-
Path to a Textweiser database configuration file.
COMMON OPTIONS
- -v / --verbose
-
Enable verbose output.
- -V / --version
-
Show version information and terminate.
- -h / --help
-
Show a short help screen and terminate.
LEARNING OPTIONS
Every invocation of tw-learn works on a single document or a set of documents and updates the profile of a category accordingly.
- -c / --cat category name
-
Name of the category the given documents are examples of (UTF-8 encoded).
MODE OPTIONS
If no mode option is given, tw-learn learns the given document or set of documents as examples of a category.
- -U / --unlearn
-
Switches to unlearning mode. Documents are considered not being an example of the given category (-c / --cat).
DIAGNOSTICS
If an error occurs, the application terminates with an appropriate error
code dependent on the operating system in use and prints an error message to
stderr.
CONFIGURATION FILE SYNTAX
The syntax of Textweiser configuration files follows an easy to use
key/value scheme.
Empty lines and any leading/trailing whitespace is ignored.
Lines starting with the character # are considered comments.
Values may be enclosed within matching single or double quotes and are
assigned to keys using the = character.
SUPPORTED KEYS
- host
-
Hostname of the database server.
- user
-
Username for database authentification.
- passwd
-
Password for database authentification.
- db_name
-
Name of the Textweiser database.
- port
-
Port number of the database server.
- instance
-
Name of the Microsoft SQL Server instance.
- encrypt
-
Enable/disable communication encryption.
The following values are recognized:
- "yes" or "on"
-
Enable encryption.
- "no" or "off"
-
Disable encryption.
In order to trust a server's certificate, append the "trust-cert" token to the value, separated by a comma and/or whitespace, i.e.
encrypt = "yes, trust-cert"
NOTES
- o
-
All category names and input documents have to be encoded in UTF-8.
- o
-
On Microsoft Windows an option may be started by the "/" character as well.
- o
-
On any Unix-like system the common sequence "--" terminates parsing of options.
- o
-
It is recommended to train each category by learning from at least ten appropriate documents.
- o
-
Any configuration file specified (-f / --config) is parsed and evaluated before other commandline arguments are evaluated. As a result, arguments given on the commandline overwrite settings given in a configuration file.
- o
-
Communication encryption is the task of the database driver. Textweiser merely instructs the driver to enable or disable encryption according to the passed options and checks whether the operation did succeed.
EXAMPLES
For brevity, the following examples assume Textweiser is using the SQLite database backend and that a Textweiser database and a set of categories have already been created. See the EXAMPLES section of tw-admin(1) for details.
Learn a set of files that are examples of a "Sales" category:
Shell $ tw-learn -v -d example.sqlt -c Sales sales_1.txt sales_2.txt
# Processing sales_1.txt... OK
# Processing sales_2.txt... OK
Learned 2 documents of category "Sales"
Unlearn a previously learned document from the "Sales" category:
Shell $ tw-learn -v -d example.sqlt -c Sales -U sales_2.txt
# Processing sales_2.txt... OK
Optimizing database
Unlearned 1 document of category "Sales"
SEE ALSO
tw-admin(1), tw-classify(1)
Textweiser User Manual
http://www.lingua-systems.com/text-classifier/textweiser-library/
COPYRIGHT
Copyright (c) 2010-2011 Lingua-Systems Software GmbH


