Skip to content

Object too large error in preprocessing script #102

@ahalterman

Description

@ahalterman

I've been getting a "bytes object is too large" error when processing a large-ish number of documents using the 01_parse.py script. Creating several smaller doc_bin objects resolves the issue. Full error:

ahalt@xxxxxxxx:~/sense2vec$ python sense2vec/scripts/01_parse.py hindu_complete.txt docbins en_core_web_sm -n 10
ℹ Using spaCy model en_core_web_sm
Preprocessing text...
Docs: 267103 [1:00:38, 73.42/s]
✔ Processed 267103 docs
Traceback (most recent call last):
  File "sense2vec/scripts/01_parse.py", line 47, in <module>
    plac.call(main)
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "sense2vec/scripts/01_parse.py", line 39, in main
    doc_bin_bytes = doc_bin.to_bytes()
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/spacy/tokens/_serialize.py", line 151, in to_bytes
    return zlib.compress(srsly.msgpack_dumps(msg))
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/srsly/_msgpack_api.py", line 16, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/srsly/msgpack/__init__.py", line 40, in packb
    return Packer(**kwargs).pack(o)
  File "_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 235, in srsly.msgpack._packer.Packer._pack
  File "_packer.pyx", line 206, in srsly.msgpack._packer.Packer._pack
ValueError: bytes object is too large

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugBugs and behaviour differing from documentationenhancementFeature requests and improvements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions