Skip to content

Conversation

@arhamm1
Copy link
Contributor

@arhamm1 arhamm1 commented Nov 19, 2025

Description

docs: improve quality assessment index with complete examples
Fixes incomplete code examples and adds context for utility functions.

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@arhamm1
Copy link
Contributor Author

arhamm1 commented Nov 19, 2025

Should the code sample be updated to ->

Combined Quality Filtering

from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
from nemo_curator.stages.audio.common import GetAudioDurationStage, PreserveByValueStage

# Create multi-stage quality pipeline
quality_pipeline = Pipeline(name="audio_quality_assessment")

# Calculate all metrics
quality_pipeline.add_stage(GetPairwiseWerStage(
    text_key="text",
    pred_text_key="pred_text",
    wer_key="wer"
))

quality_pipeline.add_stage(GetAudioDurationStage(
    audio_filepath_key="audio_filepath",
    duration_key="duration"
))

# Apply filters in sequence
quality_pipeline.add_stage(PreserveByValueStage(
    input_value_key="wer",
    target_value=50.0,
    operator="le"  # WER <= 50%
))

quality_pipeline.add_stage(PreserveByValueStage(
    input_value_key="duration",
    target_value=1.0,
    operator="ge"  # Duration >= 1s
))

quality_pipeline.add_stage(PreserveByValueStage(
    input_value_key="duration",
    target_value=20.0,
    operator="le"  # Duration <= 20s
))

@arhamm1 arhamm1 requested review from karpnv and lbliii November 19, 2025 01:15
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 19, 2025

Greptile Summary

  • Updates audio quality assessment documentation with complete code examples and better explanations of utility functions versus pipeline stages
  • Fixes incomplete ASR inference example by adding proper GPU resource allocation parameters
  • Adds cross-references and improved formatting to enhance documentation clarity and usability

Important Files Changed

Filename Overview
docs/curate-audio/process-data/quality-assessment/index.md Enhanced documentation with complete examples, fixed utility function context, and improved cross-references

Confidence score: 5/5

  • This PR is safe to merge with minimal risk as it only improves documentation quality
  • Score reflects documentation-only changes with no impact on code functionality or behavior
  • No files require special attention since this is purely a documentation enhancement

Sequence Diagram

sequenceDiagram
    participant User
    participant Pipeline
    participant CreateInitialManifestStage
    participant InferenceAsrNemoStage
    participant GetPairwiseWerStage
    participant GetAudioDurationStage
    participant PreserveByValueStage
    participant AudioToDocumentStage
    participant JsonlWriter
    participant XennaExecutor

    User->>Pipeline: "Create audio quality assessment pipeline"
    User->>Pipeline: "Add CreateInitialManifestFleursStage"
    Pipeline->>CreateInitialManifestStage: "Load audio data from FLEURS dataset"
    CreateInitialManifestStage-->>Pipeline: "Audio manifest data"

    User->>Pipeline: "Add InferenceAsrNemoStage"
    Pipeline->>InferenceAsrNemoStage: "Perform ASR inference"
    InferenceAsrNemoStage-->>Pipeline: "Audio data with predictions"

    User->>Pipeline: "Add GetPairwiseWerStage"
    Pipeline->>GetPairwiseWerStage: "Calculate WER metrics"
    GetPairwiseWerStage-->>Pipeline: "Audio data with WER scores"

    User->>Pipeline: "Add GetAudioDurationStage"
    Pipeline->>GetAudioDurationStage: "Calculate audio duration"
    GetAudioDurationStage-->>Pipeline: "Audio data with duration"

    User->>Pipeline: "Add WER filter stage"
    Pipeline->>PreserveByValueStage: "Filter by WER <= 75%"
    PreserveByValueStage-->>Pipeline: "Filtered by WER"

    User->>Pipeline: "Add duration min filter"
    Pipeline->>PreserveByValueStage: "Filter duration >= 1s"
    PreserveByValueStage-->>Pipeline: "Filtered by min duration"

    User->>Pipeline: "Add duration max filter" 
    Pipeline->>PreserveByValueStage: "Filter duration <= 30s"
    PreserveByValueStage-->>Pipeline: "Filtered by max duration"

    User->>Pipeline: "Add AudioToDocumentStage"
    Pipeline->>AudioToDocumentStage: "Convert to document format"
    AudioToDocumentStage-->>Pipeline: "Document format data"

    User->>Pipeline: "Add JsonlWriter"
    Pipeline->>JsonlWriter: "Write high quality results"
    JsonlWriter-->>Pipeline: "Export complete"

    User->>Pipeline: "Execute with XennaExecutor"
    Pipeline->>XennaExecutor: "Run quality assessment pipeline"
    XennaExecutor-->>User: "High-quality filtered audio dataset"
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants