LexiconImport

LexiconImport

The LexiconImport utility builds Loqate lexicons. Running the utility with no parameters will display the following information:

Usage: lexiconimport <Source file> <Table name> <Unicode table path>

Source file format is tab-delimited UTF8 text containing:
- Search String
- Output String (if different from Search String)
- Word Type
- Confidence

Source File Syntax

This file is in TAB Delimited UTF-8 text format and contains the following fields:
For lexicon rule entry:
  • Search String
  • Output String
  • Type
  • Confidence
Search string format is as follows:
  • The desired tokens to search for
  • Enclosed pipe delimited aliases in parenthesis
  • For example STA(TIO|SHU)N = STATION & STASHUN or (FLAT|FLT|FL) = FLAT or FLT or FL
  • Enclose class match in braces({ })
  • Syntax is {CLASSTYPE:MATCHLIMIT}
  • For example {NUMERIC:5} or {ASCII:1-5}
  • CLASSTYPE
  • NUMERIC = ([0-9] in all character sets)
  • ALPHA = (ASCII + SPACELESS)
  • ASCII = (LATIN [a-z|A-Z] + Diacritics(ž))
  • SPACELESS = (CHINESE, JAPANESE, KOREAN, THAI etc.)
  • NUMERICWORD = MATCHLIMIT refers to word count and not character count. For example, {NUMERICWORD:3-5} matches 168 766 4958 or 80 24 32 98 or 11 22 23 44 55
  • ALPHAWORD = MATCHLIMIT refers to word count and not character count. For example, {ALPHAWORD:2-4} matches MARTIN LUTHER or MARTIN LUTHER KING or MARTIN LUTHER KING JUNIOR
  • ASCIIWORD = MATCHLIMIT refers to word count and not character count. For example, {ASCIIWORD:2-3} matches Мартина Luther or Мартина Лютера Кинга in Russian.
  • SPACELESSWORD = MATCHLIMIT refers to word count. However, each character is a word. For example, {SPACELESSWORD:2-3} matches マー or マーテ in Japanese.
Definition of Word:
  • A set of Numeric or Alpha characters seperated by Non-Numeric and Non-Alpha characters.
  • Separated on Numeric/Alpha boundaries
  • For example, FLAT1 (FLAT 1) or 94158US (94158 US)
  • Each SPACELESS character is a word
  • For Example, マーテ (マ ー テ)
MATCHLIMITS
  • Single number. For example, {NUMERIC:5}
  • Range. For example, {NUMERIC:2-4}
  • Multiple numbers. For example, {NUMERIC:1,3,5}
Puntuation can be specified for a match
Space matches any punctuation. For example, {NUMERIC:5} {NUMERIC:4} matches 94063-1352
Suffixes. For example, {ASCII:3:15}(BachStraße)
Output String:
  • BLANK – Leave Input String As Is
  • CONSTANT – e.g. (AV|AVE|AVENUE|AVENU|AVNUE)<TAB>AV(Alias)
  • CLASSMATCH
  • {<WORDNUMBER>[:<START POS>,<LENGTH>]}
  • For example, {NUMERIC:7}<TAB>{1:0,3}-{1:3,4} will output string “1832715 2347901” as 183-2715
  • Type:
  • Component type of the interpretation
  • For example, ThoroughfareType, OrganizationType, Locality, PostalCodePrimary etc.
  • Type are case sensitive
Confidence
  • Confidence value ranges between 1-255
  • It is an indicator of likelihood of correct interpretation
  • For example, {NUMERIC:5}<TAB>{1}<TAB>PostalCodePrimary<TAB>250
  • There is a special case confidence 0 (zero). This means Recode. Replace existing string with new output. String syntax example, <OLD OUTPUT STRING><TAB><NEW OUTPUT STRING><TAB>0

Defining Macros

Macros allow developers to shortcut lexicon rules. Once a macro is defined, it can be used in defining a lexicon rule.

Syntax is TAB delimited and contains the following fields:

  • Macro Hash Tag
  • Macro Key String
  • Expansion String
  • Type List (comma-separated)

Macro Hash Tag

#macro
indicates the start of a macro definition.

 

Macro Key String

<key>
enclosed in angle brackets

 

Expansion String

When a macro is encountered in a lexicon rule entry, the key is automatically replaced by this string

 

Type List

List of types where the macro is valid

 

 

Macro Example

 #macro<TAB><ST><TAB>(ST|SAINT|SANKT|SAN)<TAB>Locality,Thoroughfare

#macro<TAB> <STREET><TAB>(STREET|ST|STRT)<TAB>Thoroughfare

 

The lexicon rule:

<ST> MARY <STREET><TAB>ST MARY ST<TAB>Thoroughfare<TAB>200

 

would get expanded to:

(ST|SAINT|SANKT|SAN) MARY (STREET|ST|STRT)<TAB>ST MARY ST<TAB>Thoroughfare<TAB>200