Home

News

International

Screen Shots

Documentation

Download

Build

License

Credits

Contact

SourceForge Project

Tintware Documentation : Carp Regular Expression Library : Carp Regular Expression Syntax

Carp Regular Expression Syntax

Carriage returns,linefeeds, and tabs are discarded. Spaces are preserved. The basic unit of regular expression syntax are items. Examples of items would be a literal character: z or a character set: [0-9a-fA-F]. Modifiers apply to the previous item. For example, + means that the previous item must match one or more times.

Literal Items

Any character, except for, *, +, ?, {, }, [, ], (, ), \, <, >, ^, $, |, . (period), carriage return, linefeed, and tab are literal and match their self.

Escape Items

\ begins an escape. \num is an octal character escape. \xnum is a hexidecimal character escape. \cchar is a control character. Otherwise, if it is one of, *, +, ?, {, }, [, ], \, <, >, (, ), ^, $, . (period), or |, it is just the character following the \.

Dot

Dot (period) matches any character except for a newline.

Character Classes

[ begins a character class. [^ begins a negated character class. - in a character class, except for when it is the first character is a range, including the begining and end. \ indicates an escape using the same syntax as escaped items.

Grouping

( and ) are used to group a sequence of one or items together into a single item. They have no other special meaning.

Alternatives

| indicates alternatives. It must be placed between two sequences of items.

Anchors

^ matches the beginning of a line; this can be at the beginning of the subject or just after a newline. $ matches at the end of a line; this can be at the end of the subject or just before a newline. <beginningofword> matches at the beginning of a word and <endofword> matches at the end of a word. When an anchor matches, it does not consume any of the subject.

Extended Items

< begins an extended item and > ends the extended item. The names of extended items are case insensitive.

All unicode general category values and groups of values are supported as extended items. Either the abbreviation or the name can be used. They can be made negative using not. For example, all letters would be <L> or <Letter>, which is equivalent to <Lu>|<Ll>|<Lt>|<Lm>|<Lo>. Another example: everything which is not a number would be <not,Number>. See www.unicode.org for a description of the categories.

<tab> is a tab. <newline> is a newline. <nl> is a newline. <wordchar> matches one character which has word syntax; by default this is equivalent to <Letter>|<Mark>|<Number>. <spacechar> matches one character which has space syntax; this is everything which does not have word syntax.

Simple Modifiers

*, +, and ? are the simple modifiers. * means that the previous item must be greedily matched zero or more times. + means that the previous item must be greedily matched one or more times. ? means that the previous item must be greedily matched zero or one times.

Extended Modifiers

{num} means that the previous item must be matched num times. {min,} means that the previous item must be greedily matched min or more times. {min,max} means that the previous item must be greedily matched between min and max times.

{greedy,min,} means that the previous item must be greedily matched min or more times. {greedy,min,max} means that the previous item must be greedily matched between min and max times.

{lazy,min,} means that the previous item must be lazily matched min or more times. {lazily,min,max} means that the previous item must be lazily matched between min and max times.

{keep,min,} means that the previous item must be greedily matched min or more times. {keep,min,max} means that the previous item must be greedily matched between min and max times.

Lazy vs Greedy Matching

Greedy matching starts out by matching as many times as possible and then backing off to try to complete the match. Lazy matching start out by matching as few times as possible and then gradually matching a little bit more to complete the match.

Keep vs Greedy Matching

Keep matches differ from greedy matches in that they don't backtrack; a keep either matches all of a sequence or fails. A greedy can match part of a sequence. For example, given the string aaaaaa the regular expression a{greedy,0,}a would match and a{keep,0,}a would fail.

Lookahead

Positive lookahead is indicated using {lookahead,positive} and negative lookahead is indicated using {lookahead,negative}.

Capturing and Back Referencing

{capture,name} indicates that the previous item should be captured such that it can be referenced using name. <backref,name> can be used to refer to the contents of a capture. A backref is an item. A capture is a modifier. The contents of a capture can also be used after a match has succeeded. This allows a program to use the pieces of a match.