-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Objective
Develop a custom dataset to test and evaluate our BERT topic modeling model. The current dataset (default newsletter) makes it challenging to assess the model's quality. By creating a controlled dataset with predefined topics and structures, we can better evaluate the model's performance and identify areas for improvement.
Description
-
Data Sources:
- Option 1: Export messages from the Element messaging client.
- Option 2: Generate synthetic messages using AI tools following the format of Element messaging client.
-
Dataset Requirements:
- Include messages across 5–10 predefined topics.
- Introduce overlaps between topics to mimic real-world data complexity.
- Ensure a variety of messages per topic, utilizing different keywords and lexical fields.
- The dataset should be sizable enough (e.g., 500–1000 messages) for meaningful evaluation.
Steps to Follow
-
Define Predefined Topics:
- Identify specific topics relevant to our domain (e.g., technology, health, finance, wellness, entertainment, travel).
- For each topic, list associated keywords, phrases, and lexical fields.
-
Data Generation:
-
Option 1: Export from Element Messaging Client
- Collect messages that correspond to the predefined topics.
- Use the test room create on Bored Labs Server in Element
- Format messages consistently (e.g., JSON or CSV).
-
Option 2: Generate Synthetic Messages Using AI
- Use AI tools to create messages for each topic.
- Craft prompts that guide the AI to produce messages with desired content and style.
- Ensure messages are diverse in vocabulary and structure.
- Ensure format matches what one would get from Element.
-
-
Incorporate Topic Overlaps:
- Design messages that intentionally include keywords from multiple topics.
- Create scenarios where topics naturally intersect.
-
Ensure Message Variety:
- Vary message lengths (short, medium, long).
- Include different writing styles and tones.
- Use synonyms and related terms to enrich lexical diversity.
- Test what happened if abbreviations or new words (example a new project) are introduced
-
Organize and Format the Dataset:
- Label each message with its corresponding topic(s) for validation purposes.
- Store messages in a format compatible with our BERT model (e.g., plain text files, CSV).
-
Quality Assurance:
- Review the dataset to verify topic representation and message quality (human review).
- Check for balance in the number of messages per topic.
- Ensure that overlaps are correctly implemented.
Validation Criteria for this task
- Topic Coverage: Each predefined topic has an adequate number of messages (e.g., at least 50 messages per topic).
- Overlaps Implemented: A subset of messages (e.g., 10–20%) should contain overlaps between topics.
- Variety and Diversity: Messages exhibit a range of lengths, styles, and vocabulary.
- Correct Labeling: All messages are accurately labeled with their topic(s).
- Data Quality: Messages are coherent, relevant, and free of errors.
Expected Deliverables
- A structured dataset containing all messages, ready for model input.
- Documentation outlining the dataset creation process, including:
- Topics selected and associated keywords.
- Methodology for data collection/generation.
- Any scripts, prompts, or tools used in the process - so we don't have to start from scratch if we need other datasets to compare.
Metadata
Metadata
Assignees
Labels
No labels