Table of Contents
A Python library providing curated lists of stop words across 34+ languages. Stop words are common words (like "the", "is", "at") that are typically filtered out in natural language processing and text analysis tasks.
Key Features:
- 34+ Languages - Extensive language support.
- Performance - Built-in caching for fast repeated access.
- Flexible - Custom filtering system for advanced use cases.
- Zero Dependencies - Lightweight with no external requirements.
All the available languages supported by https://github.com/Alir3z4/stop-words
Each language is identified by both its ISO 639-1 language code (e.g., en) and full name (e.g., english).
Via pip (Recommended):
$ pip install stop-wordsVia Git:
$ git clone --recursive https://github.com/Alir3z4/python-stop-words.git
$ cd python-stop-words
$ pip install -e .Requirements:
- Usually any version of Python that supports type hints and probably has not been marked as EOL.
from stop_words import get_stop_words
# Get English stop words using language code
stop_words = get_stop_words('en')
# Or use the full language name
stop_words = get_stop_words('english')
# Use in text processing
text = "The quick brown fox jumps over the lazy dog"
words = text.lower().split()
filtered_words = [word for word in words if word not in stop_words]
print(filtered_words) # ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']Use safe_get_stop_words() when you're not sure if a language is supported:
from stop_words import safe_get_stop_words
# Returns empty list instead of raising an exception
stop_words = safe_get_stop_words('klingon') # Returns []
# Works normally with supported languages
stop_words = safe_get_stop_words('fr') # Returns French stop wordsfrom stop_words import AVAILABLE_LANGUAGES, LANGUAGE_MAPPING
# List all available languages
print(AVAILABLE_LANGUAGES)
# ['arabic', 'bulgarian', 'catalan', ...]
# View language code mappings
print(LANGUAGE_MAPPING)
# {'en': 'english', 'fr': 'french', ...}By default, stop words are cached for performance. You can control this behavior:
from stop_words import get_stop_words, STOP_WORDS_CACHE
# Disable caching for this call
stop_words = get_stop_words('en', cache=False)
# Clear the cache manually
STOP_WORDS_CACHE.clear()
# Check what's cached
print(STOP_WORDS_CACHE.keys()) # ['english', 'french', ...]Apply custom transformations to stop words using the filter system:
from stop_words import get_stop_words, add_filter, remove_filter
# Add a global filter (applies to all languages)
def remove_short_words(words, language):
"""Remove words shorter than 3 characters."""
return [w for w in words if len(w) >= 3]
add_filter(remove_short_words)
stop_words = get_stop_words('en', cache=False)
# Add a language-specific filter
def uppercase_words(words):
"""Convert all words to uppercase."""
return [w.upper() for w in words]
add_filter(uppercase_words, language='english')
stop_words = get_stop_words('en', cache=False)
# Remove a filter when done
remove_filter(uppercase_words, language='english')Note: Filters only apply to newly loaded stop words, not cached ones. Use cache=False or clear the cache to apply new filters.
from stop_words import get_stop_words
import re
def preprocess_text(text, language='en'):
"""Clean and filter text for NLP tasks."""
stop_words = set(get_stop_words(language))
# Convert to lowercase and extract words
words = re.findall(r'\b\w+\b', text.lower())
# Remove stop words
filtered_words = [w for w in words if w not in stop_words]
return filtered_words
text = "The quick brown fox jumps over the lazy dog"
print(preprocess_text(text))
# ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']from stop_words import get_stop_words
def filter_multilingual_text(texts_dict):
"""Process texts in multiple languages.
Args:
texts_dict: Dictionary mapping language codes to text strings
Returns:
Dictionary with filtered words for each language
"""
results = {}
for lang_code, text in texts_dict.items():
stop_words = set(get_stop_words(lang_code))
words = text.lower().split()
filtered = [w for w in words if w not in stop_words]
results[lang_code] = filtered
return results
texts = {
'en': 'The cat is on the table',
'fr': 'Le chat est sur la table',
'es': 'El gato está en la mesa'
}
print(filter_multilingual_text(texts))from stop_words import get_stop_words
from collections import Counter
import re
def extract_keywords(text, language='en', top_n=10):
"""Extract the most common meaningful words from text."""
stop_words = set(get_stop_words(language))
# Extract words and filter
words = re.findall(r'\b\w+\b', text.lower())
meaningful_words = [w for w in words if w not in stop_words and len(w) > 2]
# Count and return top keywords
word_counts = Counter(meaningful_words)
return word_counts.most_common(top_n)
article = """
Python is a high-level programming language. Python is known for its
simplicity and readability. Many developers choose Python for data science.
"""
keywords = extract_keywords(article)
print(keywords)
# [('python', 3), ('language', 1), ('high-level', 1), ...]Load stop words for a specified language.
Parameters:
language(str): Language code (e.g., 'en') or full name (e.g., 'english')cache(bool, optional): Enable caching. Defaults to True.
Returns:
list[str]: List of stop words
Raises:
StopWordError: If language is unavailable or files are unreadable
Example:
stop_words = get_stop_words('en')
stop_words = get_stop_words('french', cache=False)Safely load stop words, returning empty list on error.
Parameters:
language(str): Language code or full name
Returns:
list[str]: Stop words, or empty list if unavailable
Example:
stop_words = safe_get_stop_words('unknown') # Returns []Register a filter function for stop word post-processing.
Parameters:
func(Callable): Filter functionlanguage(str | None, optional): Language code or None for global filter
Filter Signatures:
- Language-specific:
func(stopwords: list[str]) -> list[str] - Global:
func(stopwords: list[str], language: str) -> list[str]
Example:
def remove_short(words, lang):
return [w for w in words if len(w) > 3]
add_filter(remove_short) # Global filterRemove a previously registered filter.
Parameters:
func(Callable): The filter function to removelanguage(str | None, optional): Language code or None
Returns:
bool: True if removed, False if not found
Example:
success = remove_filter(my_filter, language='english')List of all supported language names.
['arabic', 'bulgarian', 'catalan', ...]Dictionary mapping language codes to full names.
{'en': 'english', 'fr': 'french', 'de': 'german', ...}Dictionary storing cached stop words. Can be manually cleared.
STOP_WORDS_CACHE.clear() # Clear all cached dataRaised when a language is unavailable or files cannot be read.
try:
stop_words = get_stop_words('invalid')
except StopWordError as e:
print(f"Error: {e}")Use caching - Keep
cache=True(default) for repeated access to the same languageReuse stop word sets - Convert to
set()once for O(1) lookup performance:stop_words_set = set(get_stop_words('en')) # Fast membership testing is_stop_word = 'the' in stop_words_set
Preload languages - Load stop words during initialization, not in tight loops
Use safe_get_stop_words - Avoid try/except overhead when language availability is uncertain
"Language unavailable" error
- Check spelling and use either the language code or full name
- Verify the language is in
AVAILABLE_LANGUAGES - See the Available Languages table above
"File is unreadable" error
- Ensure the package installed correctly:
pip install --force-reinstall stop-words - Check file permissions in the installation directory
- Verify the
stop-wordssubdirectory exists in the package
Filters not applying
- Filters only affect newly loaded stop words
- Clear the cache:
STOP_WORDS_CACHE.clear() - Use
cache=Falsewhen testing filters
Performance issues
- Ensure caching is enabled (default behavior)
- Convert stop word lists to sets for faster lookups
- Preload stop words outside of loops
Contributions are welcome! Here's how you can help:
- Add new languages - Submit stop word lists for unsupported languages via https://github.com/Alir3z4/stop-words
- Improve existing lists - Suggest additions or removals for existing languages via https://github.com/Alir3z4/stop-words
- Report bugs - Open issues on GitHub
- Submit PRs - Fix bugs or add features
Repository: https://github.com/Alir3z4/python-stop-words
This project is licensed under the BSD 3-Clause License. See LICENSE file for details.
See ChangeLog.rst for version history.
- Issues: https://github.com/Alir3z4/python-stop-words/issues
- PyPI: https://pypi.org/project/stop-words/
- Maintained by Alireza Savand
- Stop word lists compiled from various open sources
- Contributors: See GitHub contributors
- Stop Words - List of common stop words in various languages.
- NLTK - Natural Language Toolkit with extensive NLP features
- spaCy - Industrial-strength NLP library
- TextBlob - Simplified text processing