LexiconImport

The LexiconImport utility builds Loqate lexicons. Running the utility with no parameters will display the following information:

Usage: lexiconimport <Source file> <Table name> <Unicode table path>

Source file format is tab-delimited UTF8 text containing:
- Search String
- Output String (if different from Search String)
- Word Type
- Confidence

Source File Syntax

This file is in TAB Delimited UTF-8 text format and contains the following fields:

For lexicon rule entry:

Search String
Output String
Type
Confidence

Search string format is as follows:

The desired tokens to search for
Enclosed pipe delimited aliases in parenthesis
For example STA(TIO|SHU)N = STATION & STASHUN or (FLAT|FLT|FL) = FLAT or FLT or FL
Enclose class match in braces({ })
Syntax is {CLASSTYPE:MATCHLIMIT}
For example {NUMERIC:5} or {ASCII:1-5}
CLASSTYPE
NUMERIC = ([0-9] in all character sets)
ALPHA = (ASCII + SPACELESS)
ASCII = (LATIN [a-z|A-Z] + Diacritics(ž))
SPACELESS = (CHINESE, JAPANESE, KOREAN, THAI etc.)
NUMERICWORD = MATCHLIMIT refers to word count and not character count. For example, {NUMERICWORD:3-5} matches 168 766 4958 or 80 24 32 98 or 11 22 23 44 55
ALPHAWORD = MATCHLIMIT refers to word count and not character count. For example, {ALPHAWORD:2-4} matches MARTIN LUTHER or MARTIN LUTHER KING or MARTIN LUTHER KING JUNIOR
ASCIIWORD = MATCHLIMIT refers to word count and not character count. For example, {ASCIIWORD:2-3} matches Мартина Luther or Мартина Лютера Кинга in Russian.
SPACELESSWORD = MATCHLIMIT refers to word count. However, each character is a word. For example, {SPACELESSWORD:2-3} matches マー or マーテ in Japanese.

Definition of Word:

A set of Numeric or Alpha characters seperated by Non-Numeric and Non-Alpha characters.
Separated on Numeric/Alpha boundaries
For example, FLAT1 (FLAT 1) or 94158US (94158 US)
Each SPACELESS character is a word
For Example, マーテ (マーテ)

MATCHLIMITS

Single number. For example, {NUMERIC:5}
Range. For example, {NUMERIC:2-4}
Multiple numbers. For example, {NUMERIC:1,3,5}

Puntuation can be specified for a match

Space matches any punctuation. For example, {NUMERIC:5} {NUMERIC:4} matches 94063-1352

Suffixes. For example, {ASCII:3:15}(BachStraße)

Output String:

BLANK – Leave Input String As Is
CONSTANT – e.g. (AV|AVE|AVENUE|AVENU|AVNUE)<TAB>AV(Alias)
CLASSMATCH
{<WORDNUMBER>[:<START POS>,<LENGTH>]}
For example, {NUMERIC:7}<TAB>{1:0,3}-{1:3,4} will output string “1832715 2347901” as 183-2715
Type:
Component type of the interpretation
For example, ThoroughfareType, OrganizationType, Locality, PostalCodePrimary etc.
Type are case sensitive

Confidence

Confidence value ranges between 1-255
It is an indicator of likelihood of correct interpretation
For example, {NUMERIC:5}<TAB>{1}<TAB>PostalCodePrimary<TAB>250
There is a special case confidence 0 (zero). This means Recode. Replace existing string with new output. String syntax example, <OLD OUTPUT STRING><TAB><NEW OUTPUT STRING><TAB>0

Defining Macros

Macros allow developers to shortcut lexicon rules. Once a macro is defined, it can be used in defining a lexicon rule.

Syntax is TAB delimited and contains the following fields:

Macro Hash Tag
Macro Key String
Expansion String
Type List (comma-separated)

Macro Hash Tag

#macro

indicates the start of a macro definition.

Macro Key String

<key>

enclosed in angle brackets

Expansion String

When a macro is encountered in a lexicon rule entry, the key is automatically replaced by this string

Type List

List of types where the macro is valid

Macro Example

#macro<TAB><ST><TAB>(ST|SAINT|SANKT|SAN)<TAB>Locality,Thoroughfare

#macro<TAB> <STREET><TAB>(STREET|ST|STRT)<TAB>Thoroughfare

The lexicon rule:

<ST> MARY <STREET><TAB>ST MARY ST<TAB>Thoroughfare<TAB>200

would get expanded to:

Search Our Site

Request Support

How to get support

Office location