AST

The whole idea of the nsre.ast module is to help you building the RegExp’s graph using regular Python syntax.

If you are wondering what the fuck is this thing about graphs, don’t worry too much, let’s just say that you need the AST to build a regular expression. You can know more in the inspirational article.

Usage

The idea was to create something convenient and familiar to use. As you might have read, there is on one side matchers which will validate tokens and on the other side AST nodes that will help you build the regular expression itself.

Let’s consider this:

from nsre import *

hi = Final(Eq("h")) + Final(Eq("e")) + Final(Eq("!"))
assert re = RegExp.from_ast(hi).match("hi!")

What you see here is

  • Eq(...) is a matcher that matches a token equal to the reference passed to its constructor

  • Final(...) is a final node, aka a node that will be used for matching

  • X + Y by adding together two nodes, you expect a concatenation.

Example

You can look into the nsre.lib module to see many examples of regular expressions being built. Let’s have a look at the email parsing expression.

One of the largest advantages of this is that you can re-use the same AST several times to build your regular expressions. Let’s say that you already have an expression able to match a domain name, you can use it in an email address expression.

email_part = ascii_alnums
email_sep = Final(In(["+", "."]))
email_user = email_part + AnyNumber(email_sep + email_part)
email = email_user + seq("@") + domain_name

re = RegExp.from_ast(email)
assert re.match('remy@with-madrid.com')

Please note that here nsre.shortcuts.seq() is a shortcut that will automatically create a concatenation of Final nodes with a Eq matcher.

Operations

Let’s review all the operations that you can do with nodes. In those examples, let’s suppose that node_a would match the letter "a", node_b the letter "b", and so forth.

Concatenation

Expect two nodes to be consecutive using the + operator.

exp = node_a + node_b + node_c
# Would match "abc"

Alternation

Expect either one node either the other using the | operator.

exp = node_a + (node_b | node_c)
# Would match either "ab" or "ac"

Multiplication

Multiply a node in order to indicate repetition. You can multiply by:

  • An int, to get exactly this number of occurrences

  • slice(X, None) to get from X to +inf occurrences

  • slice(None, X) to get from 0 to X occurrences

  • slice(X, Y) to get from X to Y occurrences

exp = node_a * slice(1, 3)
# Would match "a", "aa" or "aaa"

Capture

To report the content that was matched into a capture group, simply name the capture group using brackets.

exp = node_a + (node_b | node_c)['foo']
# For "ab" group "foo" would contain "b"

Reference

On top of using the Python syntax as shortcuts, you can directly create instances of nodes. It’s sometimes more convenient to do so.

class nsre.ast.Node

Root class for a node. It has no real usage by itself but that’s useful to define operators.

Assembling the nodes together will build an AST which the RegExp class will then turn into a compiled regular expression.

Example:

>>> from nsre import *
>>> root = Final(Eq('a')) + Final(Eq('b')) * slice(1, 5)
>>> re = RegExp.from_ast(root)
>>> assert re.match('abb')
>>> assert not re.match('a')
copy()

Generates a copy of the node. This is done because of the way the graph generation works: it will put all the nodes in a graph so all of them will need a unique ID in case the same sub-tree was used several cases.

class nsre.ast.Final(statement: nsre.matchers.Matcher[~Tok, ~Out][Tok, Out])

In the end, all nodes in the graph should be Final(). They allow the engine to call the matcher (stored in statement here).

class nsre.ast.Concatenation(left: nsre.ast.Node, right: nsre.ast.Node)

Represents a concatenation of the left and the right nodes.

class nsre.ast.Alternation(left: nsre.ast.Node, right: nsre.ast.Node)

Represents an alternation of the left and the right nodes

class nsre.ast.Maybe(statement: nsre.ast.Node)

Represents 0 or 1 occurrence of the statement

class nsre.ast.AnyNumber(statement: nsre.ast.Node)

Represents 0 to +inf occurrences of the statement

class nsre.ast.Capture(name: str, statement: nsre.ast.Node)

Represents a capture group around the statement

copy()

Generates a copy of the node. This is done because of the way the graph generation works: it will put all the nodes in a graph so all of them will need a unique ID in case the same sub-tree was used several cases.