Skip to content

Issues with xpath processing along the "FullText" path template recognition. #780

@krstp

Description

@krstp

Here is a sample HTML code:

<div class="FulltextWrapper">
    ...
</div>    

Where Trafilatura fails in processing it correcty.

The root issue appears to be in FulltextWrapper in particular in Fulltext pattern recognition in too general XPATH rule processing, such as:

contains(translate(@class, "FULTEX","fultex"), "fulltext")

Not sure what is the percentage of polluted websites, but seems to me this can be more done in a more fail-safe way.

The particular call that results in the error is: trafilatura.bare_extraction(html_content)

Would be great to see those cases handled. Thanks for all the great work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions