Python – Pyparsing paragraphs

Pyparsing paragraphs… here is a solution to the problem.

Pyparsing paragraphs

I’m running into a small pyparsing issue that doesn’t seem to be solvable. I want to write a rule to parse multi-line paragraphs for me. The ultimate goal is to get a recursive syntax that will parse something like this:

Heading: awesome
    This is a paragraph and then
    a line break is inserted
    then we have more text

but this is also a different line
    with more lines attached

Other: cool
        This is another indented block
        possibly with more paragraphs

This is another way to keep this up
        and write more things

But then we can keep writing at the old level
    and get this

into something like HTML: so maybe (with the parse tree of course, I can convert it to whatever format I like).

<Heading class="awesome">

<p> This is a paragraph and then a line break is inserted and then we have more text </p>

<p> but this is also a different line with more lines attached<p>

<Other class="cool">
        <p> This is another indented block possibly with more paragraphs</p>
        <p> This is another way to keep this up and write more things</p>
    </Other>

<p> But then we can keep writing at the old level and get this</p>
</Heading>

Progress

I’ve successfully gotten into the stage where header lines and blocks can be parsed using pyparsing. But I can’t:

  • Define a paragraph as multiple lines that should be connected
  • Allows paragraph indentation

An example

From here, I can output the paragraph to one line, But there seems to be no way to convert it to a parse tree without removing the newline characters.

I think a paragraph should be:

words = ## I've defined words to allow a set of characters I need
lines = OneOrMore(words)
paragraph = OneOrMore(lines) + lineEnd

But that doesn’t seem to work for me. Any idea would be great 🙂

Solution

So I managed to fix this issue for anyone who stumbles upon this issue in the future. You can define a paragraph like this. While it’s certainly not ideal and doesn’t exactly fit the syntax I described. The relevant code is:

line = OneOrMore(CharsNotIn('\n')) + Suppress(lineEnd)
emptyline = ~line
paragraph = OneOrMore(line) + emptyline
paragraph.setParseAction(join_lines)

join_lines is defined as:

def join_lines(tokens):
    stripped = [t.strip() for t in tokens]
    joined = " ".join(stripped)
    return joined

If it meets your needs, that should point you in the right direction:) Hope this helps!

Better empty rows

The definition of blank lines given above is certainly not ideal and could be greatly improved. The best methods I’ve found are as follows:

empty_line = Suppress(LineStart() + ZeroOrMore(" ") + LineEnd())
empty_line.setWhitespaceChars("")

This allows you to fill empty lines with spaces without breaking matching.

Related Problems and Solutions