Proposal: Offset based Token Classification utilities 

# 🚀 Feature request

Hi. So we work a lot with span annotations on text that isn't tokenized and want a "canonical" way to work with that. I have some ideas and rough implementations, so I'm looking for feedback on if this belongs in the library, and if the proposed implementation is more or less good. 

I also think there is a good chance that everything I want exists, and the only solution needed is slightly clearer documentation. I should hope that's the case and happy to document if someone can point me in the right direction. 


## The Desired Capabilities

What I'd like is a canonical way to:
* Tokenize the examples in the dataset
* Align my annotations with the output tokens (see notes below) 
* Have the tokens and labels correctly padded to the max length of an example in the batch or max_sequence_length
* Have a convenient function that returns predicted offsets  

### Some Nice To Haves
* It would be nice if such a utility internally handled tagging schemes like IOB BIOES internally and optionally exposed them in the output or "folded" them to the core entities. 
* It would be nice if there was a recommended/default strategy implemented for handling examples that are longer then the max_sequence_length
* It would be amazing if we could pass labels to the tokenizer and have the alignment happen in Rust (in parallel). But I don't know Rust and I have a sense this is complicated so I won't be taking that on myself, and assuming that this is happening in Python. 


## Current State and what I'm missing 

* The docs and examples for Token Classification assume that the [text is pre-tokenized](https://github.com/huggingface/transformers/blob/1b76936d1a9d01cf99a086a3718060a64329afaa/examples/token-classification/utils_ner.py#L110).  
* For a word that has a label and is tokenized to multiple tokens, [it is recommended](https://github.com/huggingface/transformers/blob/1b76936d1a9d01cf99a086a3718060a64329afaa/examples/token-classification/utils_ner.py#L116)  to place the label on the first token and "ignore" the following tokens 
   * However it is not clear where that recommendation came from, and it has [edge cases that seem quite nasty](https://github.com/huggingface/transformers/issues/5077#issuecomment-668384357) 
* The [example pads all examples to max_sequence_length](https://github.com/huggingface/transformers/blob/1b76936d1a9d01cf99a086a3718060a64329afaa/examples/token-classification/utils_ner.py#L167)  which is a big performance hit (as opposed to bucketing by length and padding dynamically) 
* The example loads the entire dataset at once in memory. I'm not sure if this is a real problem or I'm being nitpicky, but I think "the right way" to do this would be to lazy load a batch or a few batches. 

### Alignment
The path to align tokens to span annotations  is by using the return_offsets_mapping flag on the tokenizer (which is awesome!). 
There are probably a few strategies, I've been using this
I use logic like this:
```python
def align_tokens_to_annos(offsets,annos):
    anno_ix =0
    results =[]
    done =len(annos)==0
    for offset in offsets:

        if done == True:
            results.append(dict(offset=offset,tag='O',))
        else:
            anno = annos[anno_ix]
            start, end = offset
            if end < anno['start']:
                # the offset is before the next annotation
                results.append(dict(offset=offset, tag='O', ))
            elif start <=anno['start'] and end <=anno['end']:
                results.append(dict(offset=offset, tag=f'B-{anno["tag"]}',))
            elif start>=anno['start'] and end<=anno['end']:
                results.append(dict(offset=offset, tag=f'I-{anno["tag"]}', ))
            elif start>=anno['start'] and end>anno['end']:
                anno_ix += 1
                results.append(dict(offset=offset, tag=f'E-{anno["tag"]}', ))
            else:
                raise Exception(f"Funny Overlap {offset},{anno}",)

            if anno_ix>=len(annos):
                done=True
    return results
```
 And then call that function inside add_labels here
```python
        res_batch = tokenizer([s['text'] for s in pre_batch],return_offsets_mapping=True,padding=True)
        offsets_batch = res_batch.pop('offset_mapping')
        res_batch['labels'] =[]
        for i in range(len(offsets_batch)):
          labels = add_labels(res_batch['input_ids'][i],offsets_batch[i],pre_batch[i]['annotations'])
          res_batch['labels'].append(labels)
````

This works, and it's nice because the padding is consistent with the longest sentence so bucketing gives a big boost. But, the add_labels stuff is in python and thus sequential over the examples and not super fast. I haven't measured this to confirm it's a problem, just bring it up. 

## Desired Solution

I need most of this stuff so I'm going to make it. I could do it 



The current "NER" examples and issues assume that text is pre-tokenized. Our use case is such that the full text is not tokenized and the labels for "NER" come as offsets. I propose a utility /example to handle that scenario because I haven't been able to find one. 


In practice, most values of X don't need any modification, and doing what I propose (below) in Rust is beyond me, so this might boil down to a utility class and documentation. 


## Motivation

I make [text annotation tools](https://lighttag.io) and our output is span annotations on untokenized text. I want our users to be able to easily use transformers. I suspect from my (limited) experience that in many non-academic use cases, span annotations on untokenized text is the norm and that others would benefit from this as well. 

## Possible ways to address this

I can imagine a few scenarios here

 * **This is out of scope** Maybe this isn't something that should be handled by transformers at all, and delegated to a library and blog post
* **This is in scope and just needs documentation** e.g. all the things I mentioned are things transformers should and can already do. In that case the solution would be pointing someone (me) to the right functions and adding some documentation
* **This is in scope and should be a set of utilities ** Solving this could be as simple as making a file similar to [utils_ner.py](https://github.com/huggingface/transformers/blob/1b76936d1a9d01cf99a086a3718060a64329afaa/examples/token-classification/utils_ner.py). I think that would be the simplest way to get something usable and gather feedback see if anyone else cares
* **This is in scope but should be done in Rust soon** If we want to be performance purists, it would make sense to handle the alignment of span based labels in Rust. I don't know Rust so I can't help much and I don't know if there is any appetite or capacity from someone that does, or if it's worth the (presumably) additional effort. 



## Your contribution

I'd be happy to implement and submit a PR, or make an external library or add to a relevant existing one. 

## Related issues
* #5297
* [This PR](https://github.com/huggingface/transformers/pull/3957) 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Offset based Token Classification utilities #7019

🚀 Feature request

The Desired Capabilities

Some Nice To Haves

Current State and what I'm missing

Alignment

Desired Solution

Motivation

Possible ways to address this

Your contribution

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Offset based Token Classification utilities #7019

Description

🚀 Feature request

The Desired Capabilities

Some Nice To Haves

Current State and what I'm missing

Alignment

Desired Solution

Motivation

Possible ways to address this

Your contribution

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions