Skip to content

BadZipfile: File is not a zip file #247

@dsal1951

Description

@dsal1951

I'm trying to pull bill text for previous congresses for some data science research I'm working on. When I try to run the following command

./run govinfo --collections=BILLS --extract=mods,text,xml,pdf --congress=114

I get a BadZipfile error on every bill (example for s29 below). If I try to manually open the package.zip, I end up with the never ending zip -> cpgz -> cycle.

The strange thing is that if I delete all of the text-versions subdirectories and then rerun the same command, it works fine for the vast majority of bills (~90%). I haven't been able to figure out any rhyme or reason to this behavior but can confirm that I've observed it across Mac OSx and Ubuntu as well as many congresses.

Error fetching package 114s29is in collection BILLS from https://www.govinfo.gov/app/details/BILLS-114s29is.
Traceback (most recent call last):
File "/Users/trent/Documents/congress/tasks/govinfo.py", line 174, in update_sitemap2
mirror_results = mirror_package(collection, package_name, lastmod, lastmod_cache.setdefault("packages", {}), options)
File "/Users/trent/Documents/congress/tasks/govinfo.py", line 313, in mirror_package
extracted_files = extract_package_files(collection, package_name, file_path, lastmod_cache, options)
File "/Users/trent/Documents/congress/tasks/govinfo.py", line 371, in extract_package_files
with zipfile.ZipFile(package_file) as package:
File "/anaconda2/envs/congress2/lib/python2.7/zipfile.py", line 770, in init
self._RealGetContents()
File "/anaconda2/envs/congress2/lib/python2.7/zipfile.py", line 811, in _RealGetContents
raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions