|
|||||||||
PREV NEXT | FRAMES NO FRAMES |
See:
Description
Packages | |
---|---|
frej | Includes all functionality for fuzzy regular expressions, based on fuzzy string matching/searching utilities from frej.fuzzy. |
frej.fuzzy | Includes core functionality for fuzzy string matching or substring searching. |
test | Includes simple test for debug purposes. |
Here are list of FREJ packages and guide on its usage.
New Fuzzy Regular Expression is created by instantiation of frej.Regex class. You need to pass "pattern" to constructor. This pattern describes what this regular expression should match. Here rules of pattern syntax are explained.
Pattern consist of simple structural elements enclosed in brackets. Elements usually are nested - in such case we say that "parent" element encloses one or more "child" elements. There are six simple element types:
Additionally elements could specify substitution string for itself and group capturing mark to use match or substitution result in outer elements. This is thoroughly explained in Grouping and Substitutions.
In multiline patterns it is possible to use comments. Comment is started
with double-slash and ended with line-feed, carriage-return or pattern end.
[^frej,(?test)] //sample patternLook further for multi-line examples.
Here follows sample view of pattern which distinguishes between three US presidents and provides some substitution for further processing.
[^ //President recognizer (barack, {?h*}, obama) | 44-th, (= george, washington) | 1-st, ( {^ abraham, abe}, lincoln) | 16-th ]~A | $A_president_\rof_U\_S\_A
Pattern could contain more than one expression - in this case all except first are regarded as subexpressions. They should be marked with distinct names, prefixed by double-colon and could be referenced with Subexpr element:
//main expression [^ {Baker,(@street)}, {^Lenina,(@street)}], {^Piccadilly,(?(@street))} ] //subexpression definition ::street (^st,street,lane)
References to subexpression should always precede definition of this subexpression in the pattern.
When Regex
is instantiated and initialized by some pattern,
it is ready for using against text strings by using one of methods:
match(String text)
- compares "text" with pattern,
supposing that pattern should fit "text" at whole;matchFromStart(String text)
- compares
"text" with pattern, supposing that "text" possibly have extra tokens at the
end, i.e. is longer than pattern requires (unused tail could be fetched after
matching via suffix()
method;presentInSequence(String text)
-
searches in the "text" for a fragment, which is best match for pattern (unused
beginning and ending of the text could be fetched after matching via
prefix()
and suffix()
methods.
Tokenization
When one of matching methods is called, text, which should be compared with
pattern, is broken up into separate tokens. All regexp elements, use the
input string as an array of tokens, and try to match one or more tokens from
this array. Tokens could be of following types:
Regex.setAllowedPunctuationMarks(String marks)
.
Each of allowed punctuation marks creates separate token even if marks follow
each other. All other punctuation marks as well as spaces, linefeeds etc.
are regarded as token separators.
For example, if we use (by default) "slash" and "dash" ('/' and '-') as allowed punctuation marks and try to process the line Baker street 221/B it is tokenized as [Baker, st, 221, /, B].
If then we apply to this text pattern like [ (^Baker,Taylor,Smith), st*, (#) ], then Baker token would match Baker-Taylor-Smith choice (element Any), street would match "st*" (element Token) and 221 would match Numeric element - you see that these three elements (Any, Token and Numeric) are linked together by outer brackets (i.e. Follow element) and because of this tokens which would be matched by them should immediately follow one another.
Result
Result of comparing text with pattern is evaluated as double value, which have meaning of
difference between text and pattern. Value of 0.0 means exact match. Result of matching complex
pattern against some text is evaluated by special rules explained below. All of matching methods
decide that matching is successfull if result is greater than threshold (which is initially
0.34 and could be changed with help of Regex.setThreshold(double)
.
All pattern elements except Token are enclosed in brackets. To improve readability three types of brackets are allowed - (round), [square], and {curly}. There is no difference between them, though opening and closing brackets of the element should be of one type. For example:
As it was mentioned, spaces, tabulations, line-feeds and carriage-returns in the pattern are skipped. It allows more convenient formatting of patterns, especially when specifying them in the file. Spaces are anyway token separators by default so they are useless in pattern. However, if pattern provides substitution strings, these strings could require spaces and other characters. Special symbols could be used in such cases.
Special symbols
Note: all elements except Token are enclosed in brackets.
Of those,
which needs brackets all except Follow have one-character type mark immediately after
opening bracket.
Elements which have several children should separate them with comma.
This is the simplest element. It is specified by simple writing supposed token, as it should appear in text, if it is written correctly. Special ability is that if token element has a star symbol "*" at the end, it means that element may match only beginning of the token, skipping unused characters at end.
Token is matched in fuzzy way by calling Fuzzy.similarity(CharSequence text, CharSequence pattern)
.
Result of matching some token against "Token" element is the result returned by fuzzy matching method and is roughly equal to count of mistakes divided by average between length of pattern and length of text. Mistake is absence of symbol, extra symbol, replacement of one symbol with another and swap of two neighbor symbols.
This element is constructed simply by enclosing some child elements in brackets:
( child1, child2, ..., childN )tokens of text being matched should match each of child elements in the sequentially in the same order. However if among child elements (or their children) there are "Optional" element, it (according to its design) could be skipped. For example (apple, beats, (?clean), dust):
Result of matching sequence of tokens against "Follow" element is such that:
- if any child yields result of 1.0 or greater, total result is 1.0;
- if all children yield the same result X, total result is equal to this value;
- total result is between best and worst of children results.
These rules have some "corrective" behavior - for example if one child gives result of 0.5 and other
of 0.0 then their total would be close to 0.3 - i.e. if some child fits very well, some other
is allowed to fit more poorly if when matched alone.
Exact rule of result evaluation is the following (similar to geometric mean):
Etotal = 1 - ( (1 - E1) * (1 - E1) * ... * (1 - EN) )1/N
where Ei is the result of matching i-th child (truncated to 1.0 if it was greater).
Result evaluation became complicated if "Optional" element exists among children. In this case results for two variants are compared - including this skippable part and excluding it - and then best is chosen.
This element is constructed by enclosing some children in brackets and putting "^" mark immediately
after opening bracket:
(^ child1, child2, ..., childN )each of children is tested against the same tokens and best matching children is chosen. If child elements contain substitution strings, then one of them would be returned as substitution for parent element, depending on which child was matched For example (apple|Jane, (banana, tree)|Mike, pear|Liza):
Result of matching against "Any" is equal to result of chosen child (i.e. best of child results).
When "Any" could be matched in several ways with the same result (for example if there is "Optional" among subchildren), then longest variant is selected (i.e. one, which grabs more tokens than others).
This element is constructed by enclosing exactly two children in brackets and putting "=" mark
immediately after opening bracket:
(= child1, child2 )both children should be matched with tokens of input text sequentially, but in any order.
I.e. (=A,B) equals (^(A,B),(B,A)) and is evaluated and substituted just in this manner.
This element is constructed by enclosing other element (i.e. the only child) in brackets and putting "?" mark immediately
after opening bracket:
(? child)It is usable only inside other elements (like "Follow" or "Both"), where it allows some of children to be skipped, if this give better match result.
In cases when child element is not Token an alternative short syntax is allowed. Instead of writing, for example:
(?(^child1,child2)) (?(#M:N)) (?($X))it is possible simply omit brackets around Optional's child:
(?^child1,child2) (?#M:N) (?$X)
Result is equal to result of child element, if it was not skipped. Otherwise result is not counted (i.e. if "Follow" element have N children and one of them was "Optional", and was skipped, then result of "Follow" is evaluated as if consisting of N-1 elements only).
This element is constructed by putting "#" mark immediately after opening bracket and optionally specifying range of number to be matched.
Result is equal to 0.0 if token is numeric and fits in specified range. Otherwise result is equal to 1.0
Such element type is convenient because it is usually not good idea to match numbers in fuzzy way, but often is important to check the range of number (when applied to enumerated floors, avenues, departments etc.)
This element is constructed by putting "!" (exclamation mark) immediately after opening
bracket and specifying standard regular expression after it. Some special symbols should be
escaped. Regular expression have syntax as described in java.util.regex.Pattern and matches the
token using String.matches(String regex). Example:
(!\[A-Za-z\]\{3\})matches token consisting of exactly 3 latin letters (like IBM or NaN).
Result is equal to 0.0 if token matches specified expression. Otherwise result is equal to 1.0
This element is constructed by putting "@" immediately after opening bracket and then specifying exact name of subexpression defined in the pattern (definition should be plased later in the pattern, not earlier). See subexpressions for an example of usage and of definition subexpressions themselves.
Result is just equal to result returned by subexpression.
This element is constructed by putting "$" immediately after opening bracket and then specifying
the name which identifies some of marked groups.
((^white|black,red|cyan,green|magenta,blue|yellow)~COLOR,and,($COLOR))Would match if pair of complementary colors specified, i.e. white and black or blue yellow. This element have sense mostly for complex substitutions.
Result is evaluated as if Token is matched using as a pattern the string, returned by previously matched group. I.e. if in our example red was matched and substituted with cyan in a group A, then memory element being matched against cian would return 0.25.
Any pattern element could be followed by substitution string or grouping mark (or both, in the same order).
Grouping is specified with "tilda" following name consisting of letters and digits. This makes engine remember the replacement result of the element (which now could be regarded as "group") in table of groups, so it could be later used in other matches or replacements.
Example of grouping marks usage (with no special purpose):
[ (^Anna, Bill, Dave)~NAME, (^Johnson, Smith)~LAST ]
For any element after it was matched, "replacement" or "substitution" could be retrieved. Normally for Tokens replacement equals to token pattern (if pattern have no "*" at end). In all other cases replacement equals to matched token of the text. For parent elements replacement is concatenation of children replacements.
For example if (Anna, J*) is matched against anne Jackson the replacement is equal to AnnaJackson.
Any element could specify other replacement than default with help of pipe symbol "|" after which replacement string follows. For example:
(Anna, J*|Wilson)Now anne Jackson and Ann Johnson would have replacement "AnnaWilson"
Replacement string could include references to previously matched groups, specifying this with dollar sign $ followed by group name which is enclosed in brackets preceding dollar sign. Brackets could be omitted if name consist of only one character. In this case result of replacement of specified group would be inserted into replacement string:
[(^Anna, Elisa, Jessica|Jess)~A,(^Dowson,Jackson)~SURNAME] | Is_not_$A_($SURNAME)_beautiful?Ann Douson => Is not Anna Dowson beautiful?
There are three other useful referencing marks:
prefix
- useful in
top-level expression when matching with
presentInSequence
;prefix
- useful in
top-level expression when matching with
presentInSequence
or
matchFromStart
;This part of documentation is not completed right now. Please check "Guide and Examples" on FREJ site or wait a few days.
|
|||||||||
PREV NEXT | FRAMES NO FRAMES |