Encyclopedia Project

A comprehensive toolset for extracting and analyzing keywords from scientific documents, with a focus on climate change research and IPCC reports.

Example and info

Project Overview

This project consists of two main subprojects that work together to process scientific documents and extract meaningful insights:

Keyword_extraction - AI-powered keyword extraction from text documents
Dictionary - Structured storage and analysis of extracted keywords and document content

Subprojects

Keyword_extraction

A Python-based tool that uses state-of-the-art Natural Language Processing (NLP) models to extract the most important keywords and keyphrases from scientific text documents. Built with Hugging Face transformers and optimized for academic content.

Key Features:

AI-powered keyword extraction using pre-trained models
Support for multiple text processing methods (sentence-based, chunk-based)
Batch processing for large documents
CSV output format for easy analysis
Configurable top-N keyword extraction

Use Cases:

Academic paper analysis
Research document summarization
Content indexing and search
Literature review automation

Dictionary

A structured storage system for organizing extracted keywords, document content, and metadata. Currently contains processed IPCC Working Group 1 reports with extracted keywords and full text content.

Key Features:

Organized storage of document chapters
Keyword frequency analysis
HTML and plain text document versions
CSV exports for data analysis
Structured directory organization

Current Content:

IPCC WG1 Chapter 1: Introduction
IPCC WG1 Chapter 5: Global Carbon and Other Biogeochemical Cycles
IPCC WG1 Chapter 6: Short-lived Climate Forcers

Quick Start

Prerequisites

Python 3.8 or higher
pip package manager
Virtual environment (recommended)

Installation

# Clone the repository
git clone <repository-url>
cd encyclopedia

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies for keyword extraction
cd Keyword_extraction
pip install -r requirements.txt

Basic Usage

Extract Keywords from a Document

cd Keyword_extraction
python Keyword_extraction.py -i your_document.txt -s results/ -o keywords.csv -n 500

Parameters:

-i: Input text file path
-s: Output directory for results
-o: Output CSV filename
-n: Number of top keywords to extract

View Extracted Keywords

# Navigate to Dictionary directory to view processed content
cd Dictionary/ipcc_wg1/ipcc_wg1_ch1
# View top keywords
cat top_keywords_only.txt
# Or open CSV file in Excel/Google Sheets
open top_keywords.csv

Project Structure

encyclopedia/
├── README.md                    # This file
├── Keyword_extraction/          # Keyword extraction tool
│   ├── README.md               # Subproject documentation
│   ├── Keyword_extraction.py   # Main extraction script
│   ├── requirements.txt        # Python dependencies
│   └── workflow.md            # Usage workflow
├── Dictionary/                  # Document storage and analysis
│   ├── README.md               # Subproject documentation
│   └── ipcc_wg1/              # IPCC Working Group 1 content
│       ├── ipcc_wg1_ch1/      # Chapter 1 content and keywords
│       ├── ipcc_wg1_ch5/      # Chapter 5 content and keywords
│       └── ipcc_wg1_ch6/      # Chapter 6 content and keywords
└── LICENSE                     # Project license

Technology Stack

Python: Core programming language
Transformers: Hugging Face NLP models for keyword extraction
BeautifulSoup: HTML parsing and processing
Pandas: Data manipulation and CSV export
PyTorch: Deep learning backend for NLP models

Contributing

This project follows established style guidelines:

Use absolute imports with module prefixes
Keep __init__.py files empty unless explicitly agreed
Follow established naming conventions (alphanumeric + underscores only)
Always propose changes before implementation
Work in small, testable steps

License

See LICENSE file for details.

Support

For questions or issues:

Check the subproject-specific README files
Review the workflow documentation in Keyword_extraction/workflow.md
Examine existing examples in the Dictionary directory

Development Notes

All output files are stored in designated directories to maintain project structure
The project follows climate change research examples for demonstrations
Keywords are extracted using the ml6team/keyphrase-extraction-kbir-inspec model
Document processing supports both HTML and plain text formats

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
Dictionary		Dictionary
Keyword_extraction		Keyword_extraction
demostration		demostration
docs		docs
txt2phrases		txt2phrases
LICENSE		LICENSE
README.md		README.md
TRANSCRIPT_PROCESSING_README.md		TRANSCRIPT_PROCESSING_README.md
paper.md		paper.md
publish.yml		publish.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Encyclopedia Project

Example and info

Project Overview

Subprojects

Keyword_extraction

Dictionary

Quick Start

Prerequisites

Installation

Basic Usage

Extract Keywords from a Document

View Extracted Keywords

Project Structure

Technology Stack

Contributing

License

Support

Development Notes

About

Uh oh!

Releases 3

Packages

Contributors 4

Uh oh!

Languages

License

semanticClimate/encyclopedia

Folders and files

Latest commit

History

Repository files navigation

Encyclopedia Project

Example and info

Project Overview

Subprojects

Keyword_extraction

Dictionary

Quick Start

Prerequisites

Installation

Basic Usage

Extract Keywords from a Document

View Extracted Keywords

Project Structure

Technology Stack

Contributing

License

Support

Development Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 4

Uh oh!

Languages

Packages