Pyparsing paragraphs
I’m running into a small pyparsing issue that doesn’t seem to be solvable. I want to write a rule to parse multi-line paragraphs for me. The ultimate goal is to get a recursive syntax that will parse something like this:
Heading: awesome
This is a paragraph and then
a line break is inserted
then we have more text
but this is also a different line
with more lines attached
Other: cool
This is another indented block
possibly with more paragraphs
This is another way to keep this up
and write more things
But then we can keep writing at the old level
and get this
into something like HTML: so maybe (with the parse tree of course, I can convert it to whatever format I like).
<Heading class="awesome">
<p> This is a paragraph and then a line break is inserted and then we have more text </p>
<p> but this is also a different line with more lines attached<p>
<Other class="cool">
<p> This is another indented block possibly with more paragraphs</p>
<p> This is another way to keep this up and write more things</p>
</Other>
<p> But then we can keep writing at the old level and get this</p>
</Heading>
Progress
I’ve successfully gotten into the stage where header lines and blocks can be parsed using pyparsing. But I can’t:
- Define a paragraph as multiple lines that should be connected
- Allows paragraph indentation
An example
From here, I can output the paragraph to one line, But there seems to be no way to convert it to a parse tree without removing the newline characters.
I think a paragraph should be:
words = ## I've defined words to allow a set of characters I need
lines = OneOrMore(words)
paragraph = OneOrMore(lines) + lineEnd
But that doesn’t seem to work for me. Any idea would be great 🙂
Solution
So I managed to fix this issue for anyone who stumbles upon this issue in the future. You can define a paragraph like this. While it’s certainly not ideal and doesn’t exactly fit the syntax I described. The relevant code is:
line = OneOrMore(CharsNotIn('\n')) + Suppress(lineEnd)
emptyline = ~line
paragraph = OneOrMore(line) + emptyline
paragraph.setParseAction(join_lines)
join_lines
is defined as:
def join_lines(tokens):
stripped = [t.strip() for t in tokens]
joined = " ".join(stripped)
return joined
If it meets your needs, that should point you in the right direction:) Hope this helps!
Better empty rows
The definition of blank lines given above is certainly not ideal and could be greatly improved. The best methods I’ve found are as follows:
empty_line = Suppress(LineStart() + ZeroOrMore(" ") + LineEnd())
empty_line.setWhitespaceChars("")
This allows you to fill empty lines with spaces without breaking matching.