Overview

Package

Class

Tree

Deprecated

Index

Help

PREV NEXT

FRAMES NO FRAMES

Here are list of FREJ packages and guide on its usage.

See:
Description

Packages
frej	Includes all functionality for fuzzy regular expressions, based on fuzzy string matching/searching utilities from frej.fuzzy.
frej.fuzzy	Includes core functionality for fuzzy string matching or substring searching.
test	Includes simple test for debug purposes.

Here are list of FREJ packages and guide on its usage.

Fuzzy Regexp Syntax

New Fuzzy Regular Expression is created by instantiation of frej.Regex class. You need to pass "pattern" to constructor. This pattern describes what this regular expression should match. Here rules of pattern syntax are explained.

Common principles

Pattern consist of simple structural elements enclosed in brackets. Elements usually are nested - in such case we say that "parent" element encloses one or more "child" elements. There are six simple element types:

Token should be matched by similar sequence of letters;
Follow element lists child elements which should be matched in proper order;
Any element requires matching with any one of its children;
Both element have two children and requirres match both of them either in straight or reverse order;
Optional elment marks that its only child may be matched or skipped (which is better);
Numeric matches integer in specified range;
Regular matches token with standard regexp.

Also two referencing element types are present:

Subexpr is used to insert the same subexpression into several places of main expression;
Memory matches token with result of earlier matched group.

Additionally elements could specify substitution string for itself and group capturing mark to use match or substitution result in outer elements. This is thoroughly explained in Grouping and Substitutions.

Comments

In multiline patterns it is possible to use comments. Comment is started with double-slash and ended with line-feed, carriage-return or pattern end.

[^frej,(?test)] //sample pattern

Look further for multi-line examples.

Pattern example

Here follows sample view of pattern which distinguishes between three US presidents and provides some substitution for further processing.

[^ //President recognizer
    (barack, {?h*}, obama)
        | 44-th,
    (= george, washington)
        | 1-st,
    ( {^ abraham, abe}, lincoln)
        | 16-th
]~A
    | $A_president_\rof_U\_S\_A

Subexpressions

Pattern could contain more than one expression - in this case all except first are regarded as subexpressions. They should be marked with distinct names, prefixed by double-colon and could be referenced with Subexpr element:

//main expression
[^
    {Baker,(@street)},
    {^Lenina,(@street)}],
    {^Piccadilly,(?(@street))}
]

//subexpression definition
::street
(^st,street,lane)

References to subexpression should always precede definition of this subexpression in the pattern.

String processing

When Regex is instantiated and initialized by some pattern, it is ready for using against text strings by using one of methods:

match(String text) - compares "text" with pattern, supposing that pattern should fit "text" at whole;
matchFromStart(String text) - compares "text" with pattern, supposing that "text" possibly have extra tokens at the end, i.e. is longer than pattern requires (unused tail could be fetched after matching via suffix() method;
presentInSequence(String text) - searches in the "text" for a fragment, which is best match for pattern (unused beginning and ending of the text could be fetched after matching via prefix() and suffix() methods.

Tokenization
When one of matching methods is called, text, which should be compared with pattern, is broken up into separate tokens. All regexp elements, use the input string as an array of tokens, and try to match one or more tokens from this array. Tokens could be of following types:

sequence of letters;
sequence of digits;
punctuation mark;

Not all punctuation marks beget tokens - this could be set up by Regex.setAllowedPunctuationMarks(String marks). Each of allowed punctuation marks creates separate token even if marks follow each other. All other punctuation marks as well as spaces, linefeeds etc. are regarded as token separators.

For example, if we use (by default) "slash" and "dash" ('/' and '-') as allowed punctuation marks and try to process the line Baker street 221/B it is tokenized as [Baker, st, 221, /, B].

If then we apply to this text pattern like [ (^Baker,Taylor,Smith), st*, (#) ], then Baker token would match Baker-Taylor-Smith choice (element Any), street would match "st*" (element Token) and 221 would match Numeric element - you see that these three elements (Any, Token and Numeric) are linked together by outer brackets (i.e. Follow element) and because of this tokens which would be matched by them should immediately follow one another.

Result
Result of comparing text with pattern is evaluated as double value, which have meaning of difference between text and pattern. Value of 0.0 means exact match. Result of matching complex pattern against some text is evaluated by special rules explained below. All of matching methods decide that matching is successfull if result is greater than threshold (which is initially 0.34 and could be changed with help of Regex.setThreshold(double).

Pattern characters, escapes and special symbols

All pattern elements except Token are enclosed in brackets. To improve readability three types of brackets are allowed - (round), [square], and {curly}. There is no difference between them, though opening and closing brackets of the element should be of one type. For example:

{ (^ Baker, Taylor, Smith ), st*, [#] } is the same as a pattern above;
{[^Baker, Taylor, Smith],st*,(#)} is the same too, (spaces are skipped when processing pattern);
((^Baker, Taylor, Smith],st*,(#}) is incorrect because opening and closing bracket types do not match.

As it was mentioned, spaces, tabulations, line-feeds and carriage-returns in the pattern are skipped. It allows more convenient formatting of patterns, especially when specifying them in the file. Spaces are anyway token separators by default so they are useless in pattern. However, if pattern provides substitution strings, these strings could require spaces and other characters. Special symbols could be used in such cases.

Special symbols

Space could be specified by using underscore: (Baker,st)|Sherlock_Holmes would give substitution "Sherlock Holmes";
Underscores themselves could be specified by escaping them with backslash: (B\_a\_k\_e\_r,st) would match "B_a_k_e_r st";
line-feed and carriage-return could be specified by \n and \r;
brackets could be specified by escaping them with backslash, i.e. $, $, \[, \], \{, \};
round brackets could also be specified by \o and \c (opening and closing respectively) to improve readability;
Backslash is specified by escaping it with another backslash \\.

Element types description

Note: all elements except Token are enclosed in brackets.
Of those, which needs brackets all except Follow have one-character type mark immediately after opening bracket.
Elements which have several children should separate them with comma.

Token

This is the simplest element. It is specified by simple writing supposed token, as it should appear in text, if it is written correctly. Special ability is that if token element has a star symbol "*" at the end, it means that element may match only beginning of the token, skipping unused characters at end.

Token is matched in fuzzy way by calling Fuzzy.similarity(CharSequence text, CharSequence pattern).

Result of matching some token against "Token" element is the result returned by fuzzy matching method and is roughly equal to count of mistakes divided by average between length of pattern and length of text. Mistake is absence of symbol, extra symbol, replacement of one symbol with another and swap of two neighbor symbols.

Follow

This element is constructed simply by enclosing some child elements in brackets:

( child1, child2, ..., childN )

tokens of text being matched should match each of child elements in the sequentially in the same order. However if among child elements (or their children) there are "Optional" element, it (according to its design) could be skipped. For example (apple, beats, (?clean), dust):

matches aple bets claen duust;
matches appl beatz dust;
does not match appl never beatz dust;
does not match appl beatz wet dust.

Result of matching sequence of tokens against "Follow" element is such that:
- if any child yields result of 1.0 or greater, total result is 1.0;
- if all children yield the same result X, total result is equal to this value;
- total result is between best and worst of children results.
These rules have some "corrective" behavior - for example if one child gives result of 0.5 and other of 0.0 then their total would be close to 0.3 - i.e. if some child fits very well, some other is allowed to fit more poorly if when matched alone.

Exact rule of result evaluation is the following (similar to geometric mean):
E_total = 1 - ( (1 - E₁) * (1 - E₁) * ... * (1 - E_N) )^1/N
where E_i is the result of matching i-th child (truncated to 1.0 if it was greater).

Result evaluation became complicated if "Optional" element exists among children. In this case results for two variants are compared - including this skippable part and excluding it - and then best is chosen.

Any

This element is constructed by enclosing some children in brackets and putting "^" mark immediately after opening bracket:

(^ child1, child2, ..., childN )

each of children is tested against the same tokens and best matching children is chosen. If child elements contain substitution strings, then one of them would be returned as substitution for parent element, depending on which child was matched For example (apple|Jane, (banana, tree)|Mike, pear|Liza):

matches and converts aple to Jane;
matches and converts bonano three to Mike;
does not match pear tree (and returns nothing for substitution);

Result of matching against "Any" is equal to result of chosen child (i.e. best of child results).

When "Any" could be matched in several ways with the same result (for example if there is "Optional" among subchildren), then longest variant is selected (i.e. one, which grabs more tokens than others).

Both

This element is constructed by enclosing exactly two children in brackets and putting "=" mark immediately after opening bracket:

(= child1, child2 )

both children should be matched with tokens of input text sequentially, but in any order.

I.e. (=A,B) equals (^(A,B),(B,A)) and is evaluated and substituted just in this manner.

Optional

This element is constructed by enclosing other element (i.e. the only child) in brackets and putting "?" mark immediately after opening bracket:

(? child)

It is usable only inside other elements (like "Follow" or "Both"), where it allows some of children to be skipped, if this give better match result.

In cases when child element is not Token an alternative short syntax is allowed. Instead of writing, for example:

(?(^child1,child2))
(?(#M:N))
(?($X))

it is possible simply omit brackets around Optional's child:

(?^child1,child2)
(?#M:N)
(?$X)

Result is equal to result of child element, if it was not skipped. Otherwise result is not counted (i.e. if "Follow" element have N children and one of them was "Optional", and was skipped, then result of "Follow" is evaluated as if consisting of N-1 elements only).

Numeric

This element is constructed by putting "#" mark immediately after opening bracket and optionally specifying range of number to be matched.

(#) - mathces any token which represents integer value;
(#N) - matches only if integer is between 1 and N inclusively;
(#M:N) - matches only if integer is between M and N inclusively.

Result is equal to 0.0 if token is numeric and fits in specified range. Otherwise result is equal to 1.0

Such element type is convenient because it is usually not good idea to match numbers in fuzzy way, but often is important to check the range of number (when applied to enumerated floors, avenues, departments etc.)

Regular

This element is constructed by putting "!" (exclamation mark) immediately after opening bracket and specifying standard regular expression after it. Some special symbols should be escaped. Regular expression have syntax as described in java.util.regex.Pattern and matches the token using String.matches(String regex). Example:

(!\[A-Za-z\]\{3\})

matches token consisting of exactly 3 latin letters (like IBM or NaN).

Result is equal to 0.0 if token matches specified expression. Otherwise result is equal to 1.0

Subexpr

This element is constructed by putting "@" immediately after opening bracket and then specifying exact name of subexpression defined in the pattern (definition should be plased later in the pattern, not earlier). See subexpressions for an example of usage and of definition subexpressions themselves.

Result is just equal to result returned by subexpression.

Memory

This element is constructed by putting "$" immediately after opening bracket and then specifying the name which identifies some of marked groups.

((^white|black,red|cyan,green|magenta,blue|yellow)~COLOR,and,($COLOR))

Would match if pair of complementary colors specified, i.e. white and black or blue yellow. This element have sense mostly for complex substitutions.

Result is evaluated as if Token is matched using as a pattern the string, returned by previously matched group. I.e. if in our example red was matched and substituted with cyan in a group A, then memory element being matched against cian would return 0.25.

Grouping and Substitutions

Any pattern element could be followed by substitution string or grouping mark (or both, in the same order).

Grouping mark

Grouping is specified with "tilda" following name consisting of letters and digits. This makes engine remember the replacement result of the element (which now could be regarded as "group") in table of groups, so it could be later used in other matches or replacements.

Example of grouping marks usage (with no special purpose):

[ (^Anna, Bill, Dave)~NAME, (^Johnson, Smith)~LAST ]

Substitution string

For any element after it was matched, "replacement" or "substitution" could be retrieved. Normally for Tokens replacement equals to token pattern (if pattern have no "*" at end). In all other cases replacement equals to matched token of the text. For parent elements replacement is concatenation of children replacements.

For example if (Anna, J*) is matched against anne Jackson the replacement is equal to AnnaJackson.

Any element could specify other replacement than default with help of pipe symbol "|" after which replacement string follows. For example:

(Anna, J*|Wilson)

Now anne Jackson and Ann Johnson would have replacement "AnnaWilson"

Replacement string could include references to previously matched groups, specifying this with dollar sign $ followed by group name which is enclosed in brackets preceding dollar sign. Brackets could be omitted if name consist of only one character. In this case result of replacement of specified group would be inserted into replacement string:

[(^Anna, Elisa, Jessica|Jess)~A,(^Dowson,Jackson)~SURNAME] | Is_not_$A_($SURNAME)_beautiful?

Ann Douson => Is not Anna Dowson beautiful?
Jessica Iakson => Is not Jess Jackson beautiful?
William Wilson => (nothing)

There are three other useful referencing marks:

$$ inserts matched token (useful with Token element because it gives not matched token but pattern itself as replacement by default)
$< inserts prefix - useful in top-level expression when matching with presentInSequence;
$> inserts prefix - useful in top-level expression when matching with presentInSequence or matchFromStart;

Further description

This part of documentation is not completed right now. Please check "Guide and Examples" on FREJ site or wait a few days.