-
Notifications
You must be signed in to change notification settings - Fork 213
Description
I'm trying to pull bill text for previous congresses for some data science research I'm working on. When I try to run the following command
./run govinfo --collections=BILLS --extract=mods,text,xml,pdf --congress=114
I get a BadZipfile error on every bill (example for s29 below). If I try to manually open the package.zip, I end up with the never ending zip -> cpgz -> cycle.
The strange thing is that if I delete all of the text-versions subdirectories and then rerun the same command, it works fine for the vast majority of bills (~90%). I haven't been able to figure out any rhyme or reason to this behavior but can confirm that I've observed it across Mac OSx and Ubuntu as well as many congresses.
Error fetching package 114s29is in collection BILLS from https://www.govinfo.gov/app/details/BILLS-114s29is.
Traceback (most recent call last):
File "/Users/trent/Documents/congress/tasks/govinfo.py", line 174, in update_sitemap2
mirror_results = mirror_package(collection, package_name, lastmod, lastmod_cache.setdefault("packages", {}), options)
File "/Users/trent/Documents/congress/tasks/govinfo.py", line 313, in mirror_package
extracted_files = extract_package_files(collection, package_name, file_path, lastmod_cache, options)
File "/Users/trent/Documents/congress/tasks/govinfo.py", line 371, in extract_package_files
with zipfile.ZipFile(package_file) as package:
File "/anaconda2/envs/congress2/lib/python2.7/zipfile.py", line 770, in init
self._RealGetContents()
File "/anaconda2/envs/congress2/lib/python2.7/zipfile.py", line 811, in _RealGetContents
raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file