-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
The Import Pipeline
Watch this video about how Open Library import pipeline works. Staff should see these import notes.
OpenLibrary.org offers several "Public Import API Endpoints" that can be used to submit book data for import, including one for MARC records, one for raw json book records (/api/import), and for directly importing against existing partner items (like archive.org) by ID (/api/import/ia).
Outside of these public API endpoints, Open Library also maintains a bulk batch import system for enqueueing json book data in bulk from book sources like betterworldbooks, amazon, and other trusted book providers (like librivox and standardebooks). These bulk batch imports ultimately submit records (in a systematic and rate-limited way) to the "Public Import API Endpoints", e.g. /api/import.
Once a record passes through our bulk batch import process and/or gets submitted to one of our "Public Import API Endpoints" (e.g. /api/import, see code), the data is then parsed, augmented, and validated by the "Validator" in importapi/import_edition_builder.py.
Next the formatted, validated book_edition goes through the "Import Processor" called as catalog.add_book.load(book_edition). The function has 3 paths. It tries to find an existing matching edition and its work. The options are (1) no edition/work is found and the edition is created, (2) a matched edition is found no new data is available, (3) a matched record is modified with new available data.
In the case of (1) and (3), a final step is performed called "Perform Import / Update" whose description is in load_data(). Here is a flowchart of what the internal import pipeline looks like for a record that has been submitted to a public API:

For instructions and context on testing the Cron + ImportBot pipelines, please see notes in issue #5588 and this overview video (bharat + mek)
Open Library's production automatic import pipeline consists of two components:
- A Cron service with a collection of jobs which routinely pulls data from partner source and enqueues them in a database
- An ImportBot which polls this unified database of queued import requests and process the imports them into the catalog
Note: In the following chart, the Infogami Container is detailed above in the main import flowchart
The Import Queue System is a pipeline for staging and coordinating the import of book records into Open Library's catalog. Similar to GitHub's Merge Request Queue, it provides a mechanism to queue, deduplicate, and process book imports systematically.
The system uses two PostgreSQL tables:
-
import_batch
(schema)-
id
: Primary key -
name
: Unique batch identifier (e.g., "batch-{hash}", "new-scans-202401") -
submitter
: Username of the batch creator -
submit_time
: Timestamp of batch creation
-
-
import_item
(schema)-
id
: Primary key -
batch_id
: Foreign key to import_batch -
added_time
: When the item was added to queue -
import_time
: When the item was actually imported -
status
: Item state (see Status Lifecycle below) -
error
: Error code if import failed -
ia_id
: Unique identifier (e.g., "isbn:1234567890", "amazon:B000..." or archive.org OCAID) -
data
: JSON of the book record to import -
ol_key
: Resulting Open Library key after import (e.g., "/books/OL123M") -
comments
: Optional comments -
submitter
: Username of item submitter - Unique constraint on (batch_id, ia_id) prevents duplicates within a batch
-
These tables are managed by the import database model/API: (imports.py)
Import items that are staged within the Import Queue tables progress through the following statuses (imports.py):
-
pending
: Ready for import processing -
needs_review
: Requires admin approval (for non-admin submissions) -
staged
: Pre-import state used by partner imports -
processing
: Currently being imported (temporary state to prevent race conditions) -
created
: Successfully imported as new book -
modified
: Successfully imported, updating existing book -
found
: Book already exists, no import needed -
failed
: Import failed with error code
On ol-home0
a script called manage-imports.py (colloquially called ImportBot, as it runs as Open Library user "ImportBot") runs in a dedicated docker container using the import-all
directive
-
import-all
(line 153):- Runs continuously as cron job
- Uses multiprocessing (8 workers)
- Fetches pending items once a minute in batches of 1000
- Calls Open Library's import API for each item
- Updates status in the Import Queue tables based on import result
The Import Queue web interface allows logged in patrons to submit batches of import items in jsonl format. Its REST endpoints are defined as follows:
-
/import/batch/new
(batch_imports class)- POST: Submit JSONL file or text for batch import
- Validates user authentication
- Sets status to "pending" for admins, "needs_review" for others
- Calls
batch_import()
to process submission
-
/import/batch/{id}
(BatchImportView)- GET: View batch status with pagination
- Shows import progress and item statuses
-
/import/batch/approve/{id}
(BatchImportApprove)- GET: Admin-only endpoint to approve batches
- Changes items from "needs_review" to "pending"
There are multiple paths by which data can be imported into Open Library.
- Through the website UI and the Open Library Client which both use the endpoint: https://openlibrary.org/books/add
- code: openlibrary/plugins/upstream/addbook.py
- tests: openlibrary/plugins/upstream/tests/test_addbook.py although the current tests only cover the
TestSaveBookHelper
class, which is only used by the edit book pages, not addbook.
- Through the data import API: https://openlibrary.org/api/import
- By reference to archive.org items via the IA import endpoint: https://openlibrary.org/api/import/ia
- code: openlibrary/plugins/importapi/code.py which calls
openlibrary.catalog.add_book.load()
in openlibrary/catalog/add_book/__init__.py Checking for existing works and editions is performed here inopenlibrary.catalog.add_book.exit_early()
- Add book tests: openlibrary/catalog/add_book/test_add_book.py
- code: openlibrary/plugins/importapi/code.py which calls
- Through our privileged ImportBot scripts/manage_imports.py which POSTs to the IA import API via
Openlibrary.import_ocaid()
from openlibrary/api.py - Through bulk import API openlibrary/api.py -- this should be considered deprecated
From openlibrary_cron-jobs_1
on ol-home0
enqueue a batch:
cd /openlibrary/scripts
PYTHONPATH=/openlibrary python /openlibrary/scripts/manage-imports.py --config /olsystem/etc/openlibrary.yml add-new-scans 2021-07-28
Run import on an ID from openlibrary_importbot_1
on ol-home0
cd /openlibrary/scripts
PYTHONPATH=/openlibrary python
import web
import infogami
from openlibrary import config
config.load_config('/olsystem/etc/openlibrary.yml')
config.setup_infobase_config('/olsystem/etc/infobase.yml')
importer = __import__("manage-imports")
import internetarchive as ia
item = importer.ImportItem.find_by_identifier('itinerariosporlo0000garc')
x = importer.ol_import_request(item, servername='https://openlibrary.org', require_marc=False)
Please use this new Wiki. Welcome to the Open Library Handbook! Here you will learn how to...
- Get Set Up
- Understand the Codebase
- Contribute to the Front-end
- Contribute to the Back-end
- Manage your developer environment
- Lookup Common Recipes
- Participate in the Community
Developer Guides
- BookWorm / Affiliate Server
- Developing the My Books & Reading Log
- Developing the Books page
- Understanding the "Read" Button
Other Portals
- Design
- Librarianship
- Communications
- Staff (internal)