Carp C++ Interface

Carp is a C++ library. To use it, include carp.h in your C++ source file. Link with either the debug or release version of carp.lib. For debug builds, define the preprocessor variable CARP_DEBUG before you include carp.h.

Carp supports Unicode, both UTF8 and UTF16. Internally it uses UTF16. CarpWCh is used for the UTF16 encoding and CarpUCh is used for the UTF8 encoding.

typedef unsigned short CarpWCh;
typedef unsigned char CarpUCh;

A CarpRegEx is a compiled regular expression. In a debug build, you can dump it, which will return a null terminated string containing a human readable representation of the compiled regular expression. Once you are done with the string, use free to release the memory.

class CarpRegEx
{
public:

#ifdef CARP_DEBUG
    virtual char * Dump(void) = 0;
#endif // CARP_DEBUG
};

A CarpError contains information about any error which occurs while compiling a regular expression. The Offset is the number of characters into the source string where the error occurred. CarpErrorMessages can be indexed using Error to obtain a human readable error message.

struct CarpError
{
    unsigned int Error;
    unsigned int Offset;
};

extern char * CarpErrorMessages[];

You can turn off optimization when compiling a regular expression.

#define CARP_NOOPTIMIZE      0x00000001

By default, . (dot or period) matches any character except for a newline. Specifying CARP_DOTALL will cause dot to match any character.

#define CARP_DOTALL          0x00000002

Finally, if you want, you can get more debug information to stdout when compiling a regular expression by specifying this flag.

#ifdef CARP_DEBUG
#define CARP_DEBUGOUTPUT 0x10000000
#endif // CARP_DEBUG

CarpParse::CompileRegEx is how regular expressions get compiled into CarpRegEx's. Zero will be returned on error. s is the regular expression and sl is it's length. cf is zero or more of the flags described above. CompileRegExU takes a UTF8 string and CompileRegExW takes a UTF16 string.

class CarpParse
{
public:

    virtual CarpRegEx * CompileRegExU(CarpUCh * s, unsigned int sl,
            CarpError * ce, unsigned int cf) = 0;
    virtual CarpRegEx * CompileRegExW(CarpWCh * s, unsigned int sl,
            CarpError * ce, unsigned int cf) = 0;

CarpParse::EscapeLiteral is used to escape the individual characters in a literal string so that it will still be a literal regular expression. For example, if a.*b is passed in, it will be escaped to a\.\*b. Again, EscapeLiteralU is for UTF8 strings and EscapeLiteralW is for UTF16 strings.

s is the string to be escaped and sl is it's length. r is where the escaped string should be placed; it needs to be at least twice as large as s. The return value will be the length of the escaped string.

    virtual unsigned int EscapeLiteralU(CarpUCh * s, unsigned int sl,
            CarpUCh * r) = 0;
    virtual unsigned int EscapeLiteralW(CarpWCh * s, unsigned int sl,
            CarpWCh * r) = 0;
};

NewCarpParse is used to create a new parser. DeleteCarpParse is used to delete a parser once you are done with it and DeleteCarpRegEx is used to delete a compiled regular expression.

CarpParse * NewCarpParse(void);
void DeleteCarpParse(CarpParse * cp);
void DeleteCarpRegEx(CarpRegEx * cre);

CarpMatch is how matching is done against a regular expression. NewCarpMatch is used to create a CarpMatch object.

CarpMatch * NewCarpMatch(CarpRegEx * cre);
void DeleteCarpMatch(CarpMatch * cm);

Matching forward is done using CarpMatch::MatchForward and matching backward is done using CarpMatch::MatchBackward. For both methods, cs specifies the subject of the match, idx is where to start matching from, et specifies the eol type, and clf specifies if the match ignores case. If the match is successful, the return value is where the match occurred and ml is set to the length of the match. Otherwise, MATCH_FAILED will be returned.

    virtual unsigned int MatchForward(CarpSubject * cs, unsigned int idx,
            unsigned int * ml, unsigned int et, unsigned int clf) = 0;
    virtual unsigned int MatchBackward(CarpSubject * cs, unsigned int idx,
            unsigned int * ml, unsigned int et, unsigned int clf) = 0;

Carp supports three different newline sequences: CRLF, CR, and LF.

#define CARP_EOLCRLF 0
#define CARP_EOLCR 1
#define CARP_EOLLF 2

Information about any captures can be obtained after a successful match using the following methods.

    virtual unsigned int GetNumCaptures(void) = 0;
    virtual CarpUCh * GetCaptureNameU(unsigned int idx,
            unsigned int * len) = 0;
    virtual CarpWCh * GetCaptureNameW(unsigned int idx,
            unsigned int * len) = 0;
    virtual CarpUCh * GetCaptureU(unsigned int idx, unsigned int * len) = 0;
    virtual CarpWCh * GetCaptureW(unsigned int idx, unsigned int * len) = 0;
    virtual unsigned int GetCaptureOffset(unsigned int idx) = 0;
};

CarpSubject specifies the subject for a match; you will need to provide your implementation of this class to pass to MatchForward and MatchBackward.

class CarpSubject
{
public:

If a subject uses a character set which is a single byte in width and can be mapped directly to unicode, GetMap should return an 256 element array containing the map, otherwise, CARP_UTF16MAP should be returned for a UTF16 subject and CARP_UTF8MAP should be returned for a UTF8 subject.

    virtual unsigned short * GetMap(void) = 0;

If a subject needs a different notion of syntax than the default, you will need to return an implementation of CarpSyntax. By default, all characters which are in the letter, number, and mark unicode categories are treated as being parts of word and everything else is a space.

    virtual CarpSyntax * GetSyntax(void) = 0;

Return the subject here; if you returned CARP_UTF16MAP from GetMap then it must be UTF16, or if you returned CARP_UTF8MAP from GetMap then it must be UTF8, otherwise, it will be mapped to UTF16 using the array you returned. sl returns the length of the subject in characters for UTF16, otherwise, bytes.

    virtual void * GetSubject(unsigned int * sl) = 0;

If the end of the subject is reached, carp will ask you to grow it. If you can, return a new subject and length. Otherwise, return zero.

    virtual void * GrowSubject(unsigned int * sl) = 0;

If the beginning of the subject is not actually at the beginning or the subject is not actually at the end, return nonzero. This controls whether or not ^ and $ match at the beginning and end of the subject respectively.

    virtual unsigned int NotAtBeginning(void) = 0;
    virtual unsigned int NotAtEnd(void) = 0; };

You will need to supply an implementation of CarpSyntax if you want to control which characters are parts of words and which are not. Return nonzero from WordSyntax if the character is part of a word and zero otherwise.

class CarpSyntax
{
public:

    virtual unsigned int WordSyntax(CarpWCh ch) = 0;
};