_index_


NAME

Jenda.Rex - Regular expressions via OLE

Version 0.14.05

_index_


SYNOPSIS

        Dim re As Object
        Dim arr As Variant
        Dim info As Collection
        Set re = CreateObject("Jenda.Rex")
        if re.Test( strMail, "\w+@\w+\.\w+") then
                MsgBox "Look's like a mail address", vbInformation
        end if
        arr = re.Match( strINILine, "^([\w\d]+)=(.*)$")
        if isEmpty(arr) then
                MsgBox "Malformed line! : " & strINILine, vbCritical
        else
                info.Add arr(1), arr(0)
        end if
        re.TieCollection info, "info"
        strTemplate = "%id%: %programtitle% by %author% (version %version%)"
        strToShow = re.Replace strTemplate, "%(.*?)%", "$info{$1}", g

_index_


DESCRIPTION

The Microsoft VBScript Regular Expressions (5.5) object is funny at best, it is hard to use, leads to lengthy code and really looks like something a VB programmer made.

I was so frustrated by the object that I sat down and wrote my own.

It is written in Perl so it provides full Perl regular expressions. It even allows you to access VB arrays, collections, recordsets and other similar objects from the replacement string.

Properties

Version
The version number of the object.

Encoding
The encoding in which the texts are supposed to be. Default is whatever your system is set to. The ACP value in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\NLS\CodePage.

AllowXHTML
Controls whether the HTML related functions accept XHTML. That is tags in form <NAME param=``xxx''/>.

Log
Turns on and off the logging. For test purposes only.

LogFile
Specifies the path to the log file. For test purposes only.

Methods

Test( stringToTest, regularExpression [, options])
This function tries to match the string to the regular expression (using the options) and returns success or failure.

The parts of the string captured by groups in the regexp may be accessed using the TestMatched() or Matched() method.

        If RE.Test( stringToTest, regularExpression, options) Then
                result = RE.TestMatched(0) & RE.TestMatched(1)
                         is equivalent to
        if ($stringToTest =~ /regularExpression/options) {
                $result = $1 . $2;

Please note that the matches are indexed from 0, not 1 in TestMatched() !!!

You may use the common regular expressions defined in Regesp::Common http://search.cpan.org/~abigail/Regexp-Common-2.120/lib/Regexp/Common.pm:

        If RE.Test( stringToTest, "^$RE{num}{real}") Then

TestMatched( N ) / Matched (N)
Returns the string captured by the N-1th capturing group in the regexp evaluated by the last RE.Test(). The other methods DO NOT affect the values returned by TestMatched() and each Jenda.Rex.Prepared object has its own TestMatched() buffer.

Match( stringToMatch, regularExpression [, options])
This function matches the string and returns an array containing strings. If there are no matches returns an Empty variant.
        arr = Match( stringToMatch, regularExpression, options)
                         is equivalent to
        @arr = ($stringToMatch =~ /regularExpression/options)

1) If the options DO NOT include ``g'' and there are no () groups you'll get a one element array containing (1) if the regexp matches, Empty otherwise.

2) If the options DO NOT include ``g'' and there are () groups you'll get an array containing the strings matched by the subregexps in the () groups if the regexp matches, Empty otherwise.

3) If the options DO include ``g'' and there are no () groups you'll get an array containing all the strings matched by the whole regexp if it matched at least once, Empty otherwise.

4) If the options DO include ``g'' and there are some () groups you'll get an array containing all the strings matched by all the () groups from all matches of the whole regexp in case of success, Empty otherwise.

This means that if there are three () groups then arr(0) contains the string matched by first () group the first time, arr(1) the second group, arr(2) the third group, arr(3) the second match for the first () group and so on. If a () group matched an empty string or was ``skipped'' due to ``|'' then the arr(x) will contain an empty string.

Note that if you do not specify any () groups and the options do NOT include ``g'' you only get Array(1) if the regexp matches! If you want to get the first match you have to enclose the regexp in parens!

Replace( stringToProcess, regularExpression, replacementString [, options])
Replaces the matched substring(s) by the replacementString.
        strResult = Replace( stringToProcess, regularExpression, replacementString, options)
                                        is equivalent to
        ($strResult = $stringToProcess) =~ s/regularExpression/replacementString/options)
         # make a copy and do a replace on it

The replacement string may contain $1, $2, ... variables denoting the strings matched by () groups in the regular expressions. It may even contain subscriptions to Tied arrays ``$array[$1]'' or collections ``$col{$1}''.

Prepare( regularExpression, options)
Compiles the regular expression and creates a new Jenda.Rex.Prepared object. This object contains the regular expression in compiled state, which means that not only you do not have to pass the regular expressions again, but also that using Jenda.Rex.Prepared methods is quicker than using the base Jenda.Rex with the same regular expressions. At least for more complex regexps.

You can create as many Jenda.Rex.Prepared objects as you like.

If you use the same regular expession many times it's recomended to ``Prepare'' it.

TieArray( arrayToTie, tiedName)
Copies the array to the object so that you may use it replacement strings.
        Example:
                arr = Array( "Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat" )
                re.TieArray arr, "day"
                strNamed = re.Replace ( strByNum, "Weekday:(\d+)", "Weekday:$day[$1]")

Keep in mind that the TieArray copies the array so any changes you do between TieArray and Replace are not visible!

UntieArray( tiedName)
Destroys the tied array.

TieCollection( collectionLikeObject, tiedName, [ subscriptionFunctionName] , [propertyName])
Every occurrence of $tiedName{key} will be treated as collectionLikeObject.subscriptionFunctionName(key).propertyName

Default subscriptionFunctionName is ``Item''. If you do not set propertyName we use the default property.

        Example:
                Dim col As Collection
                Set col = New Collection
                col.Add "value", "key"
                ...
                re.TieCollection col, "col"
                strResult = re.Replace strWithVars, "%(.*?)%", "$col{$1}", "g"
                Dim rst As ADODB.Recordset
                ...
                re.TieCollection rst, "data", "Fields", "Value"
                while not rst.EOF
                        strResult = re.Replace strWithVars, "%(.*?)%", "$data{$1}", g
                        MsgBox strResult, vbInformation
                        rst.MoveNext
                wend

As you can see the TieCollection unlike the TieArray only references the object so any changes you do to the object are visible in the Jenda.Rex object.

UntieCollection( tiedName)
Destroys the reference to the collection.

Keep in mind that if you Tie a collection to a Jenda.Rex object it will not be destroyed until you either Untie it or you destroy the Jenda.Rex object!

Also keep in mind that the tiedNames are CASE SENSITIVE !!!

Quote( String )
Quotes special characters in a string so that it can be safely included in a regexp and be searched for literaly.
        Example:
                Set NeedQuoting = RE.Prepare(RE.Quote(QuoteCharacter) & "|" & RE.Quote(FieldSeparator) & "|\x0D|\x0A")
                ' if it contains the separator or the quote or any end of line character

If you did not quote the variables and the FieldSeparator happened to be ``|'' you'd end up with regexp ``'|||\x0D|\x0A'' and it would of course match anything.

HTMLescape( String )
Escapes all characters special to HTML. The result will be safe to include in HTML code as TEXT.
        html = "<td><b>" & RE.HTMLescape(value) & "</b></td>"

TAGescape( String )
Escapes all characters special to HTML plus quotes and doublequotes. The result will be safe to include in HTML code as a tag parameter value.
        html = "<input type=text name=Foo value=""" & RE.TAGescape(value) & """>"

JSescape( String )
Escapes all characters special to HTML plus quotes and doublequotes, prepends a backslash to each quote or doublequote. The result will be safe to include in HTML code as a JavaScript string.
        html = "<a href=""JavaScript:doSomething( '" & RE.JSescape(value) & "');"">Click here</a>"

FUZZYescape( String )
Escapes the characters special to HTML that are NOT part of a well formed HTML tags. That is if the source string is
        <b>Holds:</b> 0 < 1

then the result will be

        <b>Holds:</b> 0 &lt; 1

PolishHTML( String )
Escapes the characters special to HTML that are NOT part of a well formed HTML tags. That is if the source string is
        <b>Holds:</b> 0 < 1

then the result will be

        <b>Holds:</b> 0 &lt; 1

and adds some paragraph and <BR> tags if the text doesn't already contain them.

To be used to polish HTML coding you got from users :-)

StripHTML
Removes all HTML tags and decodes HTML entities, does some rudimentary formatting.

DeWordify( String )
Replaces the ``cute'' MS Word quotes and apostrophes with normal ones.
        Str = RE.DeWordify( Str )

DeWordifyHTML ( String )
Removes the mso- styles and things like <o:p> added by MS Word plus fixes some common syntax errors in the HTML.
        Str = RE.DeWordifyHTML( Str )

DeMoronizeHTML( String )
Filters the HTML to remove superfluous <SPAN> and <FONT> tags and other things that take up space, but do not affect the display.

Eg. replacing ...<SPAN some attributes>whatever</SPAN><SPAN the same attributes>... by ...<SPAN some attributes>whatever..., removing <SPAN> and <FONT> tags with no attributes, removes <SPAN>, <FONT>, <B>, ... tags with only whitespace content etc.

        Str = RE.DeMoronizeHTML( Str )

ImproveHTML( String )
This is a combination of PolishHTML, DeWordifyHTML and DeMoronizeHTML
        Str = RE.ImproveHTML( Str )

DeUTF8( String)
Decodes a UTF8 encoded string and returns it encoded in the encoding you specified before (the system 8bit encoding by default).
        Str = RE.DeUTF8( Str )

EnUTF8( String)
Converts a string the encoding you specified before (the system 8bit encoding by default) to UTF8.
        Str = RE.EnUTF8( Str )

EnUTF8File( Filename, [NewFilename])
Converts the file from the preset encoding (the system 8bit encoding by default) to UTF8. If you specify only one filename then the result is written into the original file.
        RE.EnUTF8File( Filename )
        RE.EnUTF8File( Filename, OtherFilename )

HTMLfilter( strAllowedTags)
Creates an object for HTML filtering. The strAllowedTags string should contain the allowed tags and their allowed parameters in format
        tag1 tag2 tag3 ...
        tag4 tag5 : foo bar baz ...
        # comment
        tag6 # comment
        tag7 ; comment
        tag8 ' comment
        ...

The created object supports these two methods:

        strResult = objFilter.doSTRING( strSource )
        objFilter.doFILE( strSourcePath, strResultPath )

After filtering the text will only contain the allowed tags and allowed parameters. All other HTML will be stripped.

It's recomended to polish the HTML with FUZZYescape() or PolishHTML() beforehand.

        Example:
                strFilter = "B" & vbCRLF & "I" & vbCRLF & "A: HREF NAME" & vbCRLF & "BR"
                Set objFilter = re.HTMLfilter( strFilter )
                str = re.FUZZYescape( str )
                str = objFilter.doSTRING( str )

CSVParser( SeparatorChar, QuoteChar, EscapeChar, EndOfLine, AlwaysQuote, Binary)
Returns an object that can parse and create CSVs, handling the quoting and escaping. The options are:
        SeparatorChar - the separator, by default ","
        QuoteChar - the quote character, by default a doublequote
        EscapeChar - the character to use to escape the quote character, by default a doublequote
        EndOfLine - the character that denotes the end of line in the file,
                by default CRLF (vbCrLf, "\r\n", "\x0D\x0A")
        AlwaysQuote - boolean, controls whether even the items that do not contain
                the quote or separator characters are to be quoted,
                by default False
        Binary - boolean, specifies whether the included texts may contain
                characters outside the ASCII printable range.
                by default True
        Dim RE, CSV
        Set RE = Server.CreateObject("Jenda.Rex")
        Set CSV = RE.CSVParser()
        Dim Arr
        CSV.ParseFile( fileName)
        Arr = CSV.Parse( "" )
        Do While True
                If IsEmpty(Arr) Then Exit Do
                name = Arr(0)
                email = Arr(1)
                pwd = Arr(2)
                ...
        Loop
        CSV.CloseFile

See Jenda.Rex.CSVParser methods

Jenda.Rex.Prepared methods

The methods of prepared regular expressions are almost the same as for the general object. Except there is no Prepare method and the additional functions and all Test, Match and Replace do not take regularExpression and options parameters.

Test( stringToMatch)
TestMatched( N)
Match( stringToMatch)
Replace( stringToProcess, replacementString)
TieArray, UntieArray, TieCollection, UntieCollection
These methods do not work. I don't know what's the problem. But since they'd share the namespace with the basic Jenda.Rex object anyway I don't think it matters. Simply Tie the arrays and collections using the main Jenda.Rex object.

Jenda.Rex.CSVParser methods

Parse
Parses the next line from the currently opened file and returns an array containing the data. Returns Empty if there are no more lines.

Parse( line)
Parses the string passed and returns an an array containing the data. Returns Empty if the line is not formatted correctly.

Parse( line1, line2, line3, ...) =item Parse( arrayOfLines )
Parses the passed lines and returns an array of arrays. The outer array will contain Empty in the items whose lines were not formatted correctly!

Please keep in mind that arrays are ZERO based.

ParseFile( fileName)
Opens the specified file for reading.

CloseFile
Closes the file.

Combine
Combines the Pushed items and returns the resulting string including the newline character.

Combine( item1, item2, item3, ...) =item Combine( itemArray ) =item Combine( item1, itemArray1, itemArray2, item2, ...)
Combines the items specified in the paramter list (or included in the passed arrays) and returns the resulting string including the newline character. Doesn't affect the Pushed items.

Push( item1, item2, item3, ...) =item Push( itemArray ) =item Push( item1, itemArray1, itemArray2, item2, ...)
Adds the specified items into a list kept by the CSVParser object.

Flush
Combines the items added by Push and returns the resulting string. Clears the list kept by CSVParser.

Flush( 0)
Combines the items added by Push and returns the resulting string. Doesn't change the list kept by CSVParser.

Flush( Count )
Combines the items added by Push and returns the resulting string. Removes the last Count items from the list kept by CSVParser.

Flush( - Count )
Combines the items added by Push and returns the resulting string. Removes all except first Count items from the list kept by CSVParser.

Skip( Count )
Adds Count empty items into the list kept by CSVParser.

Clear
Removes all items from the list kept by CSVParser.

Count
Returns the number of items in the list kept by CSVParser.

Examples
        line = CSV.Combine( name, email, pwd, salary)
        otherLine = CSV.Combine( name, email, arrayOfSomething)
        CSV.Push( name, email)
        if includeThis Then CSV.Push( this)
        if includeThat Then CSV.Push( that)
        CSV.Push( theLastThing)
        yetOtherLine = CSV.Flush

Properties

Version
Contains the version of the object.

AllowXHTML
Controls whether XHTML tags are allowed. Only affects FUZZYescape, PolishHTML, HTMLfilter and StripHTML.
        Example:
                re.AllowXHTML = True
                MsgBox re.FUZZYescape("Hello <BR/>World")
                '       prints "Hello <BR/>World"
                re.AllowXHTML = False
                MsgBox re.FUZZYescape("Hello <BR/>World")
                '       prints "Hello &lt;BR/&gt;World"
                re.AllowXHTML = True
                filter = re.HTMLfilter( "B I BR" )
                MsgBox filter.doSTRING("<b>Hello</b> <BR/><foo>World</foo>")
                '       prints "<b>Hello</b> <BR/>World"
                re.AllowXHTML = False
                filter = re.HTMLfilter( "B I BR" )
                MsgBox filter.doSTRING("<b>Hello</b> <BR/><foo>World</foo>")
                '       prints "<b>Hello</b> World"

The HTMLfilter object remembers the state of re.AllowXHTML and is NOT affected by later changes!

Default is True.

SEE ALSO

Regular expressions documentation

I will not describe Perl regular expressions here. You may find the related docs here:

http://aspn.activestate.com/ASPN/Reference/Products/ActivePerl/lib/Pod/perlre.html and http://aspn.activestate.com//ASPN/Reference/Products/ActivePerl/lib/Pod/perlop.html

The included JendaRexTest.frm contains examples of pretty much the whole functionality.

_index_


AUTHORS

The COM object was written by Jenda@Krynicky.cz ( http://jenda.krynicky.cz )

To name all authors of Perl would be toooooo lengthy.

It was converted to an ActiveX DLL by PerlCtrl from Perl Dev Kit 4.1 by ActiveState.

_index_


COMMENTS

The DLL itself depends only on a few system DLLs that should be available everywhere.

There's one possible problem though. Jenda.Rex needs to be able to extract some files to %TEMP%\pdk directory. (The filenames will look like this: a4da113933612fb90ce798dd83702bf2.dll)

If it can't create the pdk directory or the files it will not work! You may need to review the permissions to the %TEMP%\pdk directory.

_index_


COPYRIGHT

    (c) 2001-2007 Jenda Krynicky

You may distribute under the terms of either the GNU General Public License or the Artistic License (both are easy to find on the Net) or included in your application in compiled form.

In the last case :

        - You do not have to mention Jenda.Rex in the docs or license conditions of your program.
        - You don't have to install it separately using Jenda.Rex's own installer,
           you may install it within your own instalation procedure.
           (all you need to do is to copy the JendaRex.dll to your program's directory
           and run regsvr32 /s JendaRex.dll).
        - The only thing I would like you to do is to include the JendaRex.html and install it
           into the same directory as JendaRex.dll.
           (No need to link to the file from your docs or whereever. Just place it in the same dir.)

_index_


VERSION

Version 0.14.05 10 Nov 2006

_index_