Skip to content

Inner HTML elements are processed after the tail text #333

@HiromuHota

Description

@HiromuHota

Describe the bug

In tests/data/html_simple/md.html, <em>italics and later <strong>bold</strong></em>. Even is processed in the following order:

  1. italics and later
  2. . Even
  3. bold

This is illustrated in #12 too.

To Reproduce
Steps to reproduce the behavior:

  1. Run tests/parser/test_parser.py::test_parse_md_details and set a breakpoint

Expected behavior

The said sentence is processed in the following order:

  1. italics and later
  2. bold
  3. . Even

Environment (please complete the following information):

  • Fonduer Version: [0.7.0 and master(3d5392c)]

Additional context
Add any other context about the problem here.

Why this happens?

When the node is <em>italics and later <strong>bold</strong></em>, node.text (=italics and later ) and node.tail (=. Even) are processed, the next node <strong>bold</strong> is processed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions