Skip to content

Alir3z4/python-stop-words

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Stop Words

PyPI version Python versions License

A Python library providing curated lists of stop words across 34+ languages. Stop words are common words (like "the", "is", "at") that are typically filtered out in natural language processing and text analysis tasks.

Key Features:

  • 34+ Languages - Extensive language support.
  • Performance - Built-in caching for fast repeated access.
  • Flexible - Custom filtering system for advanced use cases.
  • Zero Dependencies - Lightweight with no external requirements.

All the available languages supported by https://github.com/Alir3z4/stop-words

Each language is identified by both its ISO 639-1 language code (e.g., en) and full name (e.g., english).

Via pip (Recommended):

$ pip install stop-words

Via Git:

$ git clone --recursive https://github.com/Alir3z4/python-stop-words.git
$ cd python-stop-words
$ pip install -e .

Requirements:

  • Usually any version of Python that supports type hints and probably has not been marked as EOL.
from stop_words import get_stop_words

# Get English stop words using language code
stop_words = get_stop_words('en')

# Or use the full language name
stop_words = get_stop_words('english')

# Use in text processing
text = "The quick brown fox jumps over the lazy dog"
words = text.lower().split()
filtered_words = [word for word in words if word not in stop_words]
print(filtered_words)  # ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

Use safe_get_stop_words() when you're not sure if a language is supported:

from stop_words import safe_get_stop_words

# Returns empty list instead of raising an exception
stop_words = safe_get_stop_words('klingon')  # Returns []

# Works normally with supported languages
stop_words = safe_get_stop_words('fr')  # Returns French stop words
from stop_words import AVAILABLE_LANGUAGES, LANGUAGE_MAPPING

# List all available languages
print(AVAILABLE_LANGUAGES)
# ['arabic', 'bulgarian', 'catalan', ...]

# View language code mappings
print(LANGUAGE_MAPPING)
# {'en': 'english', 'fr': 'french', ...}

By default, stop words are cached for performance. You can control this behavior:

from stop_words import get_stop_words, STOP_WORDS_CACHE

# Disable caching for this call
stop_words = get_stop_words('en', cache=False)

# Clear the cache manually
STOP_WORDS_CACHE.clear()

# Check what's cached
print(STOP_WORDS_CACHE.keys())  # ['english', 'french', ...]

Apply custom transformations to stop words using the filter system:

from stop_words import get_stop_words, add_filter, remove_filter

# Add a global filter (applies to all languages)
def remove_short_words(words, language):
    """Remove words shorter than 3 characters."""
    return [w for w in words if len(w) >= 3]

add_filter(remove_short_words)
stop_words = get_stop_words('en', cache=False)

# Add a language-specific filter
def uppercase_words(words):
    """Convert all words to uppercase."""
    return [w.upper() for w in words]

add_filter(uppercase_words, language='english')
stop_words = get_stop_words('en', cache=False)

# Remove a filter when done
remove_filter(uppercase_words, language='english')

Note: Filters only apply to newly loaded stop words, not cached ones. Use cache=False or clear the cache to apply new filters.

from stop_words import get_stop_words
import re

def preprocess_text(text, language='en'):
    """Clean and filter text for NLP tasks."""
    stop_words = set(get_stop_words(language))

    # Convert to lowercase and extract words
    words = re.findall(r'\b\w+\b', text.lower())

    # Remove stop words
    filtered_words = [w for w in words if w not in stop_words]

    return filtered_words

text = "The quick brown fox jumps over the lazy dog"
print(preprocess_text(text))
# ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
from stop_words import get_stop_words

def filter_multilingual_text(texts_dict):
    """Process texts in multiple languages.

    Args:
        texts_dict: Dictionary mapping language codes to text strings

    Returns:
        Dictionary with filtered words for each language
    """
    results = {}

    for lang_code, text in texts_dict.items():
        stop_words = set(get_stop_words(lang_code))
        words = text.lower().split()
        filtered = [w for w in words if w not in stop_words]
        results[lang_code] = filtered

    return results

texts = {
    'en': 'The cat is on the table',
    'fr': 'Le chat est sur la table',
    'es': 'El gato está en la mesa'
}

print(filter_multilingual_text(texts))
from stop_words import get_stop_words
from collections import Counter
import re

def extract_keywords(text, language='en', top_n=10):
    """Extract the most common meaningful words from text."""
    stop_words = set(get_stop_words(language))

    # Extract words and filter
    words = re.findall(r'\b\w+\b', text.lower())
    meaningful_words = [w for w in words if w not in stop_words and len(w) > 2]

    # Count and return top keywords
    word_counts = Counter(meaningful_words)
    return word_counts.most_common(top_n)

article = """
Python is a high-level programming language. Python is known for its
simplicity and readability. Many developers choose Python for data science.
"""

keywords = extract_keywords(article)
print(keywords)
# [('python', 3), ('language', 1), ('high-level', 1), ...]

get_stop_words(language, *, cache=True)

Load stop words for a specified language.

Parameters:

  • language (str): Language code (e.g., 'en') or full name (e.g., 'english')
  • cache (bool, optional): Enable caching. Defaults to True.

Returns:

  • list[str]: List of stop words

Raises:

  • StopWordError: If language is unavailable or files are unreadable

Example:

stop_words = get_stop_words('en')
stop_words = get_stop_words('french', cache=False)

safe_get_stop_words(language)

Safely load stop words, returning empty list on error.

Parameters:

  • language (str): Language code or full name

Returns:

  • list[str]: Stop words, or empty list if unavailable

Example:

stop_words = safe_get_stop_words('unknown')  # Returns []

add_filter(func, language=None)

Register a filter function for stop word post-processing.

Parameters:

  • func (Callable): Filter function
  • language (str | None, optional): Language code or None for global filter

Filter Signatures:

  • Language-specific: func(stopwords: list[str]) -> list[str]
  • Global: func(stopwords: list[str], language: str) -> list[str]

Example:

def remove_short(words, lang):
    return [w for w in words if len(w) > 3]

add_filter(remove_short)  # Global filter

remove_filter(func, language=None)

Remove a previously registered filter.

Parameters:

  • func (Callable): The filter function to remove
  • language (str | None, optional): Language code or None

Returns:

  • bool: True if removed, False if not found

Example:

success = remove_filter(my_filter, language='english')

AVAILABLE_LANGUAGES

List of all supported language names.

['arabic', 'bulgarian', 'catalan', ...]

LANGUAGE_MAPPING

Dictionary mapping language codes to full names.

{'en': 'english', 'fr': 'french', 'de': 'german', ...}

STOP_WORDS_CACHE

Dictionary storing cached stop words. Can be manually cleared.

STOP_WORDS_CACHE.clear()  # Clear all cached data

StopWordError

Raised when a language is unavailable or files cannot be read.

try:
    stop_words = get_stop_words('invalid')
except StopWordError as e:
    print(f"Error: {e}")
  1. Use caching - Keep cache=True (default) for repeated access to the same language

  2. Reuse stop word sets - Convert to set() once for O(1) lookup performance:

    stop_words_set = set(get_stop_words('en'))
    # Fast membership testing
    is_stop_word = 'the' in stop_words_set
  3. Preload languages - Load stop words during initialization, not in tight loops

  4. Use safe_get_stop_words - Avoid try/except overhead when language availability is uncertain

"Language unavailable" error

  • Check spelling and use either the language code or full name
  • Verify the language is in AVAILABLE_LANGUAGES
  • See the Available Languages table above

"File is unreadable" error

  • Ensure the package installed correctly: pip install --force-reinstall stop-words
  • Check file permissions in the installation directory
  • Verify the stop-words subdirectory exists in the package

Filters not applying

  • Filters only affect newly loaded stop words
  • Clear the cache: STOP_WORDS_CACHE.clear()
  • Use cache=False when testing filters

Performance issues

  • Ensure caching is enabled (default behavior)
  • Convert stop word lists to sets for faster lookups
  • Preload stop words outside of loops

Contributions are welcome! Here's how you can help:

  1. Add new languages - Submit stop word lists for unsupported languages via https://github.com/Alir3z4/stop-words
  2. Improve existing lists - Suggest additions or removals for existing languages via https://github.com/Alir3z4/stop-words
  3. Report bugs - Open issues on GitHub
  4. Submit PRs - Fix bugs or add features

Repository: https://github.com/Alir3z4/python-stop-words

This project is licensed under the BSD 3-Clause License. See LICENSE file for details.

See ChangeLog.rst for version history.

  • Stop Words - List of common stop words in various languages.
  • NLTK - Natural Language Toolkit with extensive NLP features
  • spaCy - Industrial-strength NLP library
  • TextBlob - Simplified text processing