LexiconImport
The LexiconImport utility builds Loqate lexicons. Running the utility with no parameters will display the following information:
Usage: lexiconimport <Source file> <Table name> <Unicode table path> Source file format is tab-delimited UTF8 text containing: - Search String - Output String (if different from Search String) - Word Type - Confidence
Source File Syntax
This file is in TAB Delimited UTF-8 text format and contains the following fields:
For lexicon rule entry:
- Search String
- Output String
- Type
- Confidence
Search string format is as follows:
- The desired tokens to search for
- Enclosed pipe delimited aliases in parenthesis
- For example STA(TIO|SHU)N = STATION & STASHUN or (FLAT|FLT|FL) = FLAT or FLT or FL
- Enclose class match in braces({ })
- Syntax is {CLASSTYPE:MATCHLIMIT}
- For example {NUMERIC:5} or {ASCII:1-5}
- CLASSTYPE
- NUMERIC = ([0-9] in all character sets)
- ALPHA = (ASCII + SPACELESS)
- ASCII = (LATIN [a-z|A-Z] + Diacritics(ž))
- SPACELESS = (CHINESE, JAPANESE, KOREAN, THAI etc.)
- NUMERICWORD = MATCHLIMIT refers to word count and not character count. For example, {NUMERICWORD:3-5} matches 168 766 4958 or 80 24 32 98 or 11 22 23 44 55
- ALPHAWORD = MATCHLIMIT refers to word count and not character count. For example, {ALPHAWORD:2-4} matches MARTIN LUTHER or MARTIN LUTHER KING or MARTIN LUTHER KING JUNIOR
- ASCIIWORD = MATCHLIMIT refers to word count and not character count. For example, {ASCIIWORD:2-3} matches Мартина Luther or Мартина Лютера Кинга in Russian.
- SPACELESSWORD = MATCHLIMIT refers to word count. However, each character is a word. For example, {SPACELESSWORD:2-3} matches マー or マーテ in Japanese.
Definition of Word:
- A set of Numeric or Alpha characters seperated by Non-Numeric and Non-Alpha characters.
- Separated on Numeric/Alpha boundaries
- For example, FLAT1 (FLAT 1) or 94158US (94158 US)
- Each SPACELESS character is a word
- For Example, マーテ (マ ー テ)
MATCHLIMITS
- Single number. For example, {NUMERIC:5}
- Range. For example, {NUMERIC:2-4}
- Multiple numbers. For example, {NUMERIC:1,3,5}
Puntuation can be specified for a match
Space matches any punctuation. For example, {NUMERIC:5} {NUMERIC:4} matches 94063-1352
Suffixes. For example, {ASCII:3:15}(BachStraße)
Output String:
- BLANK – Leave Input String As Is
- CONSTANT – e.g. (AV|AVE|AVENUE|AVENU|AVNUE)<TAB>AV(Alias)
- CLASSMATCH
- {<WORDNUMBER>[:<START POS>,<LENGTH>]}
- For example, {NUMERIC:7}<TAB>{1:0,3}-{1:3,4} will output string “1832715 2347901” as 183-2715
- Type:
- Component type of the interpretation
- For example, ThoroughfareType, OrganizationType, Locality, PostalCodePrimary etc.
- Type are case sensitive
Confidence
- Confidence value ranges between 1-255
- It is an indicator of likelihood of correct interpretation
- For example, {NUMERIC:5}<TAB>{1}<TAB>PostalCodePrimary<TAB>250
- There is a special case confidence 0 (zero). This means Recode. Replace existing string with new output. String syntax example, <OLD OUTPUT STRING><TAB><NEW OUTPUT STRING><TAB>0
Defining Macros
Macros allow developers to shortcut lexicon rules. Once a macro is defined, it can be used in defining a lexicon rule.
Syntax is TAB delimited and contains the following fields:
- Macro Hash Tag
- Macro Key String
- Expansion String
- Type List (comma-separated)
Macro Hash Tag
#macro
indicates the start of a macro definition.
Macro Key String
<key>
enclosed in angle brackets
Expansion String
When a macro is encountered in a lexicon rule entry, the key is automatically replaced by this string
Type List
List of types where the macro is valid
Macro Example
#macro<TAB><ST><TAB>(ST|SAINT|SANKT|SAN)<TAB>Locality,Thoroughfare
#macro<TAB> <STREET><TAB>(STREET|ST|STRT)<TAB>Thoroughfare
The lexicon rule:
<ST> MARY <STREET><TAB>ST MARY ST<TAB>Thoroughfare<TAB>200
would get expanded to:
(ST|SAINT|SANKT|SAN) MARY (STREET|ST|STRT)<TAB>ST MARY ST<TAB>Thoroughfare<TAB>200