Skip to content

Conversation

@zhiboniu
Copy link
Collaborator

add clip and coca

image_processor = CLIPImageProcessor.from_pretrained(os.path.join(model_args.model, "processor", "train"))
text_processor = CLIPTextProcessor.from_pretrained(os.path.join(model_args.model, "processor", "train"))
tokenizer = AutoTokenizer.from_pretrained(os.path.join(model_args.model, "processor"))
tokenizer = SimpleTokenizer()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是否可以复用autotokenizer,clip,evaclip和coca映射到同一个字段上

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autotokenizer处理的结果有些diff,在外层不好改,所以最后还是换用了simpletokenizer, 跟openclip行为一致。

@zhiboniu zhiboniu merged commit 71b0886 into PaddlePaddle:develop Aug 31, 2023
westfish pushed a commit to westfish/PaddleMIX that referenced this pull request Sep 25, 2024
lyuwenyu added a commit that referenced this pull request Feb 20, 2025
## 算子目录

- [1. 转换算子](#1-转换算子)
  - [1.1 llava转换算子](#11-llava转换算子)
    - [1.1.1 llava_convert](#111-llava_convert)
- [2. 过滤算子](#2-过滤算子)
  - [2.1 基础过滤算子](#21-基础过滤算子)
    - [2.1.1 valid_data_filter](#211-valid_data_filter)
- [2.1.1.1 image_compliance_operator](#2111-image_compliance_operator)
- [2.1.1.2
conversation_compliance_operator](#2112-conversation_compliance_operator)
  - [2.2 文本过滤算子](#22-文本过滤算子)
- [2.2.1 conversation_length_filter](#221-conversation_length_filter)
- [2.2.2 average_line_length_filter](#222-average_line_length_filter)
- [2.2.3 maximum_line_length_filter](#223-maximum_line_length_filter)
- [2.2.4
conversation_percentage_filter](#224-conversation_percentage_filter)
    - [2.2.5 token_num_filter](#225-token_num_filter)
    - [2.2.6 alphanumeric_ratio_filter](#226-alphanumeric_ratio_filter)
    - [2.2.7 stopwords_ratio_filter](#227-stopwords_ratio_filter)
    - [2.2.8 special_characters_filter](#228-special_characters_filter)
    - [2.2.9 language_id_filter](#229-language_id_filter)
    - [2.2.10 text_action_filter](#2210-text_action_filter)
- [2.2.11
text_entity_dependency_filter](#2211-text_entity_dependency_filter)
- [2.2.12
char_ngram_repetition_filter](#2212-char_ngram_repetition_filter)
- [2.2.13
word_ngram_repetition_filter](#2213-word_ngram_repetition_filter)
    - [2.2.14 conversation_hash_filter](#2214-conversation_hash_filter)
- [2.2.14.1
simhash_duplicate_operator](#22141-simhash_duplicate_operator)
- [2.2.14.2
minhash_duplicate_operator](#22142-minhash_duplicate_operator)
    - [2.2.15 llm_judge_filter](#2215-llm_judge_filter)
  - [2.3 图像过滤算子](#23-图像过滤算子)
    - [2.3.1 image_filesize_filter](#231-image_filesize_filter)
    - [2.3.2 image_ration_filter](#232-image_ration_filter)
    - [2.3.3 image_resolution_filter](#233-image_resolution_filter)
    - [2.3.4 image_hash_filter](#234-image_hash_filter)
  - [2.4 图文过滤算子](#24-图文过滤算子)
    - [2.4.1 image_clip_filter](#241-image_clip_filter)
- [3. 分析算子](#3-分析算子)
  - [3.1 基础分析算子](#31-基础分析算子)
    - [3.1.1 base_analysis_pipeline](#311-base_analysis_pipeline)
- [3.1.1.1 analyze_dataset_statistics](#3111-analyze_dataset_statistics)
- [3.1.1.2
analyze_language_distribution](#3112-analyze_language_distribution)
      - [3.1.1.3 analyze_image_paths](#3113-analyze_image_paths)
      - [3.1.1.4 analyze_data_anomalies](#3114-analyze_data_anomalies)
- [3.1.1.5
analyze_conversation_tokens](#3115-analyze_conversation_tokens)
  - [3.2 进阶分析算子](#32-进阶分析算子)
    - [3.2.1 description_analysis](#321-description_analysis)
    - [3.2.2 quality_analysis](#322-quality_analysis)
- [4. 可视化算子](#4-可视化算子)
  - [4.1 lda可视化算子](#41-lda可视化算子)
    - [4.1.1 lda_topic_clustering](#411-lda_topic_clustering)
- [5. 生成算子](#5-生成算子)
  - [5.1 多模态生成算子](#51-多模态生成算子)
    - [5.1.1 generate_qna_for_images](#511-generate_qna_for_images)



--- 
- #1055
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants