Skip to content

RegexMatchSpan with sep="" concatenates words with sep="(space)" #270

@HiromuHota

Description

@HiromuHota

Describe the bug
A clear and concise description of what the bug is.

A sentence "123 456 789" is parsed and gets three words "123", "456", and "789".
I'd like to match a number like

RegexMatchSpan(rgx=r"\d{9}", sep="")

but sep="" has no effect.

To Reproduce
Steps to reproduce the behavior:

  1. Have a sentence "123 456 789"
  2. Parse it
  3. Try to match it with RegexMatchSpan(rgx=r"\d{9}", sep="")

Expected behavior
A clear and concise description of what you expected to happen.

RegexMatchSpan(rgx=r"\d{9}", sep="") matches a sentence of "123 456 789".

Environment (please complete the following information):

  • Fonduer Version: 0.6.2

Additional context
Add any other context about the problem here.

I think the root cause of this issue is the following implementation.

def get_attrib_span(self, a, sep=" "):
"""Get the span of sentence attribute *a*.
Intuitively, like calling::
sep.join(span.a)
:param a: The attribute to get a span for.
:type a: str
:param sep: The separator to use for the join.
:type sep: str
:return: The joined tokens, or text if a="words".
:rtype: str
"""
# NOTE: Special behavior for words currently (due to correspondence
# with char_offsets)
if a == "words":
return self.sentence.text[self.char_start : self.char_end + 1]
else:
return sep.join(self.get_attrib_tokens(a))

where a is words by default.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions