[AI] Create a Custom Dataset to Evaluate and Fine-Tune BERT

## Objective

Develop a custom dataset to test and evaluate our BERT topic modeling model. The current dataset (default newsletter) makes it challenging to assess the model's quality. By creating a controlled dataset with predefined topics and structures, we can better evaluate the model's performance and identify areas for improvement.

## Description

- **Data Sources:**
  - **Option 1:** Export messages from the Element messaging client.
  - **Option 2:** Generate synthetic messages using AI tools following the format of Element messaging client.

- **Dataset Requirements:**
  - Include messages across **5–10 predefined topics**.
  - Introduce **overlaps** between topics to mimic real-world data complexity.
  - Ensure a **variety of messages** per topic, utilizing different keywords and lexical fields.
  - The dataset should be sizable enough (e.g., 500–1000 messages) for meaningful evaluation.

## Steps to Follow

1. **Define Predefined Topics:**
   - Identify **specific topics** relevant to our domain (e.g., technology, health, finance, wellness, entertainment, travel).
   - For each topic, list associated **keywords**, **phrases**, and **lexical fields**.

2. **Data Generation:**
   - **Option 1: Export from Element Messaging Client**
     - Collect messages that correspond to the predefined topics.
     - Use the test room create on Bored Labs Server in Element
     - Format messages consistently (e.g., JSON or CSV).

   - **Option 2: Generate Synthetic Messages Using AI**
     - Use AI tools to create messages for each topic.
     - Craft prompts that guide the AI to produce messages with desired content and style.
     - Ensure messages are diverse in vocabulary and structure.
     - Ensure format matches what one would get from Element.

3. **Incorporate Topic Overlaps:**
   - Design messages that intentionally include keywords from multiple topics.
   - Create scenarios where topics naturally intersect.

4. **Ensure Message Variety:**
   - Vary message lengths (short, medium, long).
   - Include different writing styles and tones.
   - Use synonyms and related terms to enrich lexical diversity.
   - Test what happened if abbreviations or new words (example a new project) are introduced

5. **Organize and Format the Dataset:**
   - Label each message with its corresponding topic(s) for validation purposes.
   - Store messages in a format compatible with our BERT model (e.g., plain text files, CSV).

6. **Quality Assurance:**
   - Review the dataset to verify topic representation and message quality (human review).
   - Check for balance in the number of messages per topic.
   - Ensure that overlaps are correctly implemented.

## Validation Criteria for this task

- **Topic Coverage:** Each predefined topic has an adequate number of messages (e.g., at least 50 messages per topic).
- **Overlaps Implemented:** A subset of messages (e.g., 10–20%) should contain overlaps between topics.
- **Variety and Diversity:** Messages exhibit a range of lengths, styles, and vocabulary.
- **Correct Labeling:** All messages are accurately labeled with their topic(s).
- **Data Quality:** Messages are coherent, relevant, and free of errors.

## Expected Deliverables

- A structured dataset containing all messages, ready for model input.
- Documentation outlining the dataset creation process, including:
  - Topics selected and associated keywords.
  - Methodology for data collection/generation.
  - Any scripts, prompts, or tools used in the process - so we don't have to start from scratch if we need other datasets to compare.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AI] Create a Custom Dataset to Evaluate and Fine-Tune BERT #15

Objective

Description

Steps to Follow

Validation Criteria for this task

Expected Deliverables

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[AI] Create a Custom Dataset to Evaluate and Fine-Tune BERT #15

Description

Objective

Description

Steps to Follow

Validation Criteria for this task

Expected Deliverables

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions