Issues with xpath processing along the "FullText" path template recognition.

Here is a sample HTML code:

```
<div class="FulltextWrapper">
    ...
</div>    
```

Where Trafilatura fails in processing it correcty.

The root issue appears to be in `FulltextWrapper` in particular in `Fulltext` pattern recognition in too general XPATH rule processing, such as:
https://github.com/adbar/trafilatura/blob/42ada5a515132b3fa7f0b2fbbb8ea5b1e05f4e50/trafilatura/xpaths.py#L45

Not sure what is the percentage of polluted websites, but seems to me this can be more done in a more fail-safe way.

The particular call that results in the error is: `trafilatura.bare_extraction(html_content)`

Would be great to see those cases handled. Thanks for all the great work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Issues with xpath processing along the "FullText" path template recognition. #780

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Issues with xpath processing along the "FullText" path template recognition. #780

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions