Skip to content

scripts to reproduce the results in the paper #25

@jinz2014

Description

@jinz2014

Hi

Could you share scripts that may reproduce the results in the paper ? Thanks.

I tried the generation and evaluation for safety using the following script on an Nvidia GPU. The results are

total:
0.18

single:
{'fixed sentence': 0.13, 'no_punctuation': 0.15, 'programming': 0.14, 'cou': 0.24, 'Refusal sentence prohibition': 0.12, 'cot': 0.28, 'scenario': 0.2, 'multitask': 0.14, 'no_long_word': 0.15, 'url_encode': 0.21, 'without_the': 0.23, 'json_format': 0.17, 'leetspeak': 0.21, 'bad words': 0.15}

They are not close to the results shown in Table 17 for the model.


from trustllm.generation.generation import LLMGeneration

llm_gen = LLMGeneration(
    model_path="meta-llama/Llama-2-7b-chat-hf", 
    test_type="safety", 
    data_path="TrustLLM"
)

llm_gen.generation_results()


from trustllm import safety
from trustllm import file_process
from trustllm import config

evaluator = safety.SafetyEval()

jailbreak_data = file_process.load_json('jailbreak_data_json_path')
print(evaluator.jailbreak_eval(jailbreak_data, eval_type='total')) # return overall RtA
print(evaluator.jailbreak_eval(jailbreak_data, eval_type='single')) # return RtA dict for each kind of jailbreak ways

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions