add clip and coca #111

zhiboniu · 2023-08-31T06:15:06Z

add clip and coca

jerrywgz · 2023-08-31T07:09:10Z

paddlemix/examples/evaclip/run_pretrain_dist.py

    image_processor = CLIPImageProcessor.from_pretrained(os.path.join(model_args.model, "processor", "train"))
    text_processor = CLIPTextProcessor.from_pretrained(os.path.join(model_args.model, "processor", "train"))
-    tokenizer = AutoTokenizer.from_pretrained(os.path.join(model_args.model, "processor"))
+    tokenizer = SimpleTokenizer()


这里是否可以复用autotokenizer，clip，evaclip和coca映射到同一个字段上

autotokenizer处理的结果有些diff，在外层不好改，所以最后还是换用了simpletokenizer, 跟openclip行为一致。

paddlemix/examples/clip/run_pretrain_dist.py

paddlemix/optimization.py

add clip and coca

## 算子目录 - [1. 转换算子](#1-转换算子) - [1.1 llava转换算子](#11-llava转换算子) - [1.1.1 llava_convert](#111-llava_convert) - [2. 过滤算子](#2-过滤算子) - [2.1 基础过滤算子](#21-基础过滤算子) - [2.1.1 valid_data_filter](#211-valid_data_filter) - [2.1.1.1 image_compliance_operator](#2111-image_compliance_operator) - [2.1.1.2 conversation_compliance_operator](#2112-conversation_compliance_operator) - [2.2 文本过滤算子](#22-文本过滤算子) - [2.2.1 conversation_length_filter](#221-conversation_length_filter) - [2.2.2 average_line_length_filter](#222-average_line_length_filter) - [2.2.3 maximum_line_length_filter](#223-maximum_line_length_filter) - [2.2.4 conversation_percentage_filter](#224-conversation_percentage_filter) - [2.2.5 token_num_filter](#225-token_num_filter) - [2.2.6 alphanumeric_ratio_filter](#226-alphanumeric_ratio_filter) - [2.2.7 stopwords_ratio_filter](#227-stopwords_ratio_filter) - [2.2.8 special_characters_filter](#228-special_characters_filter) - [2.2.9 language_id_filter](#229-language_id_filter) - [2.2.10 text_action_filter](#2210-text_action_filter) - [2.2.11 text_entity_dependency_filter](#2211-text_entity_dependency_filter) - [2.2.12 char_ngram_repetition_filter](#2212-char_ngram_repetition_filter) - [2.2.13 word_ngram_repetition_filter](#2213-word_ngram_repetition_filter) - [2.2.14 conversation_hash_filter](#2214-conversation_hash_filter) - [2.2.14.1 simhash_duplicate_operator](#22141-simhash_duplicate_operator) - [2.2.14.2 minhash_duplicate_operator](#22142-minhash_duplicate_operator) - [2.2.15 llm_judge_filter](#2215-llm_judge_filter) - [2.3 图像过滤算子](#23-图像过滤算子) - [2.3.1 image_filesize_filter](#231-image_filesize_filter) - [2.3.2 image_ration_filter](#232-image_ration_filter) - [2.3.3 image_resolution_filter](#233-image_resolution_filter) - [2.3.4 image_hash_filter](#234-image_hash_filter) - [2.4 图文过滤算子](#24-图文过滤算子) - [2.4.1 image_clip_filter](#241-image_clip_filter) - [3. 分析算子](#3-分析算子) - [3.1 基础分析算子](#31-基础分析算子) - [3.1.1 base_analysis_pipeline](#311-base_analysis_pipeline) - [3.1.1.1 analyze_dataset_statistics](#3111-analyze_dataset_statistics) - [3.1.1.2 analyze_language_distribution](#3112-analyze_language_distribution) - [3.1.1.3 analyze_image_paths](#3113-analyze_image_paths) - [3.1.1.4 analyze_data_anomalies](#3114-analyze_data_anomalies) - [3.1.1.5 analyze_conversation_tokens](#3115-analyze_conversation_tokens) - [3.2 进阶分析算子](#32-进阶分析算子) - [3.2.1 description_analysis](#321-description_analysis) - [3.2.2 quality_analysis](#322-quality_analysis) - [4. 可视化算子](#4-可视化算子) - [4.1 lda可视化算子](#41-lda可视化算子) - [4.1.1 lda_topic_clustering](#411-lda_topic_clustering) - [5. 生成算子](#5-生成算子) - [5.1 多模态生成算子](#51-多模态生成算子) - [5.1.1 generate_qna_for_images](#511-generate_qna_for_images) --- - #1055

add clip and coca

8164666

zhiboniu force-pushed the dev_evaclip branch from fe5a5fe to 8164666 Compare August 31, 2023 06:30

jerrywgz reviewed Aug 31, 2023

View reviewed changes

add clip/coca readme

d53628e

jerrywgz approved these changes Aug 31, 2023

View reviewed changes

zhiboniu merged commit 71b0886 into PaddlePaddle:develop Aug 31, 2023

westfish pushed a commit to westfish/PaddleMIX that referenced this pull request Sep 25, 2024

Merge pull request PaddlePaddle#111 from zhiboniu/dev_evaclip

b4ef735

add clip and coca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add clip and coca #111

add clip and coca #111

Uh oh!

zhiboniu commented Aug 31, 2023

Uh oh!

jerrywgz Aug 31, 2023

Uh oh!

zhiboniu Aug 31, 2023

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add clip and coca #111

add clip and coca #111

Uh oh!

Conversation

zhiboniu commented Aug 31, 2023

Uh oh!

jerrywgz Aug 31, 2023

Choose a reason for hiding this comment

Uh oh!

zhiboniu Aug 31, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants