The Trials and Tribulations of RegEx’ing and Other Stories…

By | March 4, 2017

RegEx’ing C++

So I thought I was on to something using expandable macros to create complex regular expressions. I managed to defined an expression that recognised C++ class definitions and match these throughout all my C++ code (I still had to write all the DUnit tests).

So what am I talking about. In a previous post (C++ Books and Possible Insanity (RegEx’ing C++)…) I thought of an alternate approach to creating regular expressions using expandable macros so that the regular expression definitions looked similar to the Backus-Naur grammar for the thing you’re parsing. So for instance we could define a macro as follows for a C++ identifier:

$(Identifier)=[a-zA-Z_]\w*

This matches any letter or underscore for the start of the identifier and then any letter, number or underscore for the remaining characters. I can then use this macro in another definition as follows for a Nested Name Specifier:

$(!NestedNameSpecifier)=($(Identifier)\s*::)

These can then be used in further expressions as follows:

$(!TemplateName)=$(Identifier)
$(!TemplateNameList)=($(NestedNameSpecifier))?$(Identifier)
$(!SimpleTemplateID)=$(TemplateName)\<($(TemplateNameList))?\>

So with this and a successful matching of class definitions off I when… and eventually came to a halt! Why? The expressions that were being produced for function definitions became way to complex and the regular expression parser (TRegEx in this case) started raising errors because the expressions had exceeded circa 32,000 characters. The reason for this is that the C++ grammar (I was using the C++14 standard as a reference – couldn’t find 11) is very complex and allows the definitions of things like enumerates and classes in-line within a function. I tried to get around this by limiting the expressions to not allow this but I still ran out of characters 🙁

So is this the end of the great experiment? Yes and No! I still think the technique will be useful in the further for me or someone else and for this reason I’ve published the testing application DGH Regular Expressions so that people can use the code I’ve done if it provides useful for them (a simpler grammar like Object Pascal could still possibly be parsed using this technique). The application has a class TDGHRegExPreProcessingEng = Class(TInterfacedObject, IDGHRegExPreProcessingEng) in the module DGHRegEx.RegExPreProcessingEng.pas which is a pre-processor for your macro definitions.

Does this mean that the OTA project to provide code completion and browsing using CTRL+SHIFT+UP/DOWN for C++ Builder is dead? No. All the above means is that I need to write a recursive descent parser. Is this going to be in C++… err… no! Why? I need a C++ parser for my Browse and Doc It plug-in so I will create one in Object Pascal that just does declarations and reuse that code. I’ve already started compiling a Backus-Naur grammar file for C++14 (I don’t find the way the grammar is described in the standard very readable) and I’ve already started to refactor the code that makes up Browse and Doc It.

Open Tools API Code

I’ve also made a decision to curtail my backwards comparability for OTA code to RAD Studio 2010 and onwards. Why? Well going any further back preclude the use of things like namespaces, generics, anonymous methods, regular expressions, to name a few.

Last night I was making a new OTA tool (which I’ll write about separately) backwardly compatible to 2010 and even going back that far causes issues with the resolution of namespaces in 2010 and XE (makes the uses causes messy).