crawl-first

Deterministic biosample enrichment for LLM-ready data preparation

Overview

crawl-first systematically follows discoverable links from NMDC biosample records to gather environmental, geospatial, weather, publication, and ontological data. This deterministic approach enables comprehensive data enrichment before downstream LLM analysis.

Philosophy

Instead of letting LLMs make API calls or guess at missing data, crawl-first embodies the principle: gather first, analyze second. This ensures reproducible, comprehensive datasets for AI analysis.

Features

Biosample enrichment: Follows discoverable linked data sources
Geospatial analysis: Coordinates, elevation, land cover, soil types
Weather integration: Historical weather data for sample collection dates
Publication tracking: DOI resolution, full-text retrieval when available
Ontology enrichment: ENVO term matching for environmental descriptors
Quality validation: Distance/elevation comparisons between data sources
Interactive maps: Generated URLs for coordinate validation
Comprehensive caching: Prevents redundant API calls

System Requirements

The following system utilities are required for development and testing:

curl - API requests and data fetching
jq - JSON processing in Makefile targets
shuf - Random sampling of biosample IDs
head - Data sampling utilities
Standard Unix utilities: mkdir, rm, find, wc

On macOS, these are typically pre-installed. On Ubuntu/Debian:

sudo apt update && sudo apt install curl jq coreutils

Installation

uv add crawl-first

Usage

Single biosample

uv run crawl-first --biosample-id nmdc:bsm-11-abc123 --email [email protected] --output-file result.yaml

Multiple biosamples

uv run crawl-first --input-file biosample_ids.txt --email [email protected] --output-dir results/

Sample from large dataset

uv run crawl-first --input-file all_biosamples.txt --sample-size 50 --email [email protected] --output-dir sample_results/

Output Structure

Each enriched biosample contains:

Asserted data: Original NMDC biosample record
Inferred data: Discovered linked information
- Soil analysis with ENVO ontology terms
- Land cover classification across multiple systems
- Weather data from collection date
- Publication metadata and full-text when available
- Geospatial features within configurable radius
- Coordinate validation and distance calculations

Data Sources

NMDC API: Biosample and study metadata
Land Use MCP: Land cover classification systems
Weather MCP: Historical meteorological data
OLS MCP: Ontology term resolution
ARTL MCP: Publication and full-text retrieval
OpenStreetMap: Environmental feature mapping
Elevation APIs: Topographic data validation

Development

git clone https://github.com/contextualizer-ai/crawl-first.git
cd crawl-first

# Install dependencies
uv sync --dev

# Run quality checks and tests
make all

# Or run individual commands:
uv run pytest
uv run black .
uv run ruff check .
uv run mypy .
uv run deptry .

Note: Full development workflow including data fetching and testing requires the system dependencies listed above.

MCP Configuration for Claude Integration

The repository includes Makefile targets that integrate with Claude Code for testing and automation. These targets require a properly configured .mcp.json file in your Claude configuration directory.

Note: Makefile targets like claude-weather-test.txt and random-ids-test.txt will not work without proper MCP server configuration in Claude.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github/workflows		.github/workflows
archives		archives
data/outputs/validation-streaming-custom		data/outputs/validation-streaming-custom
src/crawl_first		src/crawl_first
tests		tests
.gitignore		.gitignore
.mcp.json		.mcp.json
LICENSE		LICENSE
Makefile		Makefile
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SETUP.md		SETUP.md
nmdc_biosample_enrichment_enriched_plan_v_2.md		nmdc_biosample_enrichment_enriched_plan_v_2.md
philosophy.md		philosophy.md
process_random_ids.sh		process_random_ids.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock
validation_agent_improvements.md		validation_agent_improvements.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

crawl-first

Overview

Philosophy

Features

System Requirements

Installation

Usage

Single biosample

Multiple biosamples

Sample from large dataset

Output Structure

Data Sources

Development

MCP Configuration for Claude Integration

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

contextualizer-ai/crawl-first

Folders and files

Latest commit

History

Repository files navigation

crawl-first

Overview

Philosophy

Features

System Requirements

Installation

Usage

Single biosample

Multiple biosamples

Sample from large dataset

Output Structure

Data Sources

Development

MCP Configuration for Claude Integration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages