General Conference Talk Scraper

This project is a web scraper designed to extract and clean data from the General Conference talks website for The Church of Jesus Christ of Latter-day Saints. The script navigates through the site to gather details from individual conference sessions, including the title, speaker, calling, and content of each talk, along with any associated footnotes. The data is then cleaned to ensure consistency and saved in a structured CSV file for easy analysis.

Key Features

Scraping: Gathers conference talks dating back several decades.
Data Collection: Extracts talk titles, speakers, callings, content, and footnotes.
Data Cleaning: Standardizes the data by:
- Removing non-talk rows (e.g., session headings, "Church Auditing Department" rows).
- Standardizing callings (e.g., "Quorum of the 12" and "Seventy").
- Removing speaker titles such as "Elder," "President," etc.
Output Formats: Saves the data in both CSV and JSON formats, ready for analysis.

How to Use

Clone this repository and open the script in a Python environment, such as Google Colab.
Run the script to scrape, clean, and store the conference talks.
The cleaned data will be saved as cleaned_conference_talks.csv and conference_talks.json.
For long-running tasks, consider saving output files directly to Google Drive to avoid data loss due to session resets.

Running in Google Colab

This script can be easily run in Google Colab.
The scraping and cleaning process takes about 10 minutes in Colab.

Data Cleaning

After scraping, the data undergoes several cleaning operations:

Removal of session heading rows (e.g., morning/afternoon/evening sessions).
Removal of rows related to "Church Auditing Department" or non-talk content.
Standardization of callings like "Quorum of the 12" and "Seventy."
Removal of speaker titles such as "Elder," "President," "Sister," etc.
The cleaned data is saved in cleaned_conference_talks.csv.

Acknowledgments

This code is based on the original work found in the LDS Conference Scraper GitHub repository by johnmwood. """

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
ConferenceScraper.ipynb		ConferenceScraper.ipynb
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
conference_talks.jsonl		conference_talks.jsonl
prompt_generator.py		prompt_generator.py
training_data.jsonl		training_data.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

General Conference Talk Scraper

Key Features

How to Use

Running in Google Colab

Data Cleaning

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

clintlord-church/GeneralConferenceScraper

Folders and files

Latest commit

History

Repository files navigation

General Conference Talk Scraper

Key Features

How to Use

Running in Google Colab

Data Cleaning

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages