Skip to content

[AI] Create a Custom Dataset to Evaluate and Fine-Tune BERT #15

@sajz

Description

@sajz

Objective

Develop a custom dataset to test and evaluate our BERT topic modeling model. The current dataset (default newsletter) makes it challenging to assess the model's quality. By creating a controlled dataset with predefined topics and structures, we can better evaluate the model's performance and identify areas for improvement.

Description

  • Data Sources:

    • Option 1: Export messages from the Element messaging client.
    • Option 2: Generate synthetic messages using AI tools following the format of Element messaging client.
  • Dataset Requirements:

    • Include messages across 5–10 predefined topics.
    • Introduce overlaps between topics to mimic real-world data complexity.
    • Ensure a variety of messages per topic, utilizing different keywords and lexical fields.
    • The dataset should be sizable enough (e.g., 500–1000 messages) for meaningful evaluation.

Steps to Follow

  1. Define Predefined Topics:

    • Identify specific topics relevant to our domain (e.g., technology, health, finance, wellness, entertainment, travel).
    • For each topic, list associated keywords, phrases, and lexical fields.
  2. Data Generation:

    • Option 1: Export from Element Messaging Client

      • Collect messages that correspond to the predefined topics.
      • Use the test room create on Bored Labs Server in Element
      • Format messages consistently (e.g., JSON or CSV).
    • Option 2: Generate Synthetic Messages Using AI

      • Use AI tools to create messages for each topic.
      • Craft prompts that guide the AI to produce messages with desired content and style.
      • Ensure messages are diverse in vocabulary and structure.
      • Ensure format matches what one would get from Element.
  3. Incorporate Topic Overlaps:

    • Design messages that intentionally include keywords from multiple topics.
    • Create scenarios where topics naturally intersect.
  4. Ensure Message Variety:

    • Vary message lengths (short, medium, long).
    • Include different writing styles and tones.
    • Use synonyms and related terms to enrich lexical diversity.
    • Test what happened if abbreviations or new words (example a new project) are introduced
  5. Organize and Format the Dataset:

    • Label each message with its corresponding topic(s) for validation purposes.
    • Store messages in a format compatible with our BERT model (e.g., plain text files, CSV).
  6. Quality Assurance:

    • Review the dataset to verify topic representation and message quality (human review).
    • Check for balance in the number of messages per topic.
    • Ensure that overlaps are correctly implemented.

Validation Criteria for this task

  • Topic Coverage: Each predefined topic has an adequate number of messages (e.g., at least 50 messages per topic).
  • Overlaps Implemented: A subset of messages (e.g., 10–20%) should contain overlaps between topics.
  • Variety and Diversity: Messages exhibit a range of lengths, styles, and vocabulary.
  • Correct Labeling: All messages are accurately labeled with their topic(s).
  • Data Quality: Messages are coherent, relevant, and free of errors.

Expected Deliverables

  • A structured dataset containing all messages, ready for model input.
  • Documentation outlining the dataset creation process, including:
    • Topics selected and associated keywords.
    • Methodology for data collection/generation.
    • Any scripts, prompts, or tools used in the process - so we don't have to start from scratch if we need other datasets to compare.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions