Issue
Hi I have regex like this
(.*(?=\sI+)*) (.*)
But it doesn't capture groups correctly as I need.
For this example data :
- Vladimir Goth
- Langraab II Landgraab
- Léa Magdalena III Rouault Something
- Anna Maria Teodora
- Léa Maria Teodora II
1,2 are only correctly captured.
So what I need is
- If there is no I+ is split by first space.
- If after I+ there are other words first gorup should contains all to I+. So, group1 for 3rd example should be Léa Magdalena III
- If after I+ there aren't any other words like in example 5, group1 should be capture to first space.
@Edit I+ should be replaced by roman numbers
Solution
If you want to support any Roman numbers you can use
^(\S+(?:.*\b(?=[MDCLXVI])M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})\b(?= +\S))?) +(.*)
If you need to support Roman numbers up to XX (exclusive):
^(\S+(?:.*\b(?=[XVI])X?(?:IX|IV|V?I{0,3})\b(?= +\S))?) +(.*)
See the regex demo #1 and demo #2. Replace spaces with \h or \s in the Java code and double backslashes in the Java string literal.
Details:
^- start of string(- Group 1 start:\S+- one or more non-whitespaces(?:- a non-capturing group:.*- any zero or more chars other than line break chars as many as possible\b- a word boundary(?=[MDCLXVI])- require at least one Roman digit immediately to the rightM{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})- a Roman number pattern\b- a word boundary(?= +\S)- a positive lookahead that requires one or more spaces and then one non-whitespace right after the current position
)?- end of the non-capturing group, repeat one or zero times (it is optional)
)- end of the first group+- one or more spaces(.*)- Group 2: the rest of the line.
In Java:
String regex = "^(\\S+(?:.*\\b(?=[MDCLXVI])M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})\\b(?=\\h+\\S))?)\\h+(.*)";
// Or
String regex = "^(\\S+(?:.*\\b(?=[XVI])X?(?:IX|IV|V?I{0,3})\\b(?=\\s+\S))?)\\s+(.*)";
Answered By - Wiktor Stribiżew
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.