-
-
Notifications
You must be signed in to change notification settings - Fork 322
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Here is a sample HTML code:
<div class="FulltextWrapper">
...
</div>
Where Trafilatura fails in processing it correcty.
The root issue appears to be in FulltextWrapper in particular in Fulltext pattern recognition in too general XPATH rule processing, such as:
trafilatura/trafilatura/xpaths.py
Line 45 in 42ada5a
| contains(translate(@class, "FULTEX","fultex"), "fulltext") |
Not sure what is the percentage of polluted websites, but seems to me this can be more done in a more fail-safe way.
The particular call that results in the error is: trafilatura.bare_extraction(html_content)
Would be great to see those cases handled. Thanks for all the great work!
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working