【NLP-04】tranformers库保姆级使用教程---以BERT模型为例_机器学习

安装

要安装一个非常轻量级的transformers库，您可以执行以下步骤：

1、打开终端或命令提示符。
2、运行以下命令来安装transformers库：

pip install transformers

这将使用pip工具从python package index（pypi）下载并安装transformers库。请确保您的计算机上已经安装了pip。

然后，您可以在python代码中导入transformers库：

import transformers

这样就可以使用transformers库中提供的功能了。

如果您想安装包含几乎所有用例所需依赖项的开发版本，可以执行以下步骤：

1、打开终端或命令提示符。
2、运行以下命令来安装transformers库及其相关依赖项：

pip install transformers[sentencepiece]

这将安装transformers库的开发版本，并包括用于处理句子拆分的sentencepiece依赖项。注意，这个版本可能比轻量级版本更大，因为它包含了更多的依赖项。

一、模型简介 transformer models

transformers库中的pipeline函数是一个非常方便的工具，可以直接使用预训练模型进行文本处理。下面是一些pipeline的简单示例：

1、情感分析（sentiment analysis）：

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("i've been waiting for a huggingface course my whole life.")

print(result)

输出：

[{'label': 'positive', 'score': 0.9598047137260437}]

您还可以传递多个句子进行情感分析：

results = classifier([
    "i've been waiting for a huggingface course my whole life.",
    "i hate this so much!"
])

print(results)

输出：

[{'label': 'positive', 'score': 0.9598047137260437},
 {'label': 'negative', 'score': 0.9994558095932007}]

2、其他可用的pipeline：

除了情感分析，transformers库还提供了其他一些可用的pipeline，如特征提取（feature-extraction）、命名实体识别（ner）、问答（question-answering）、摘要（summarization）、文本生成（text-generation）、翻译（translation）等。您可以根据需要选择适合您任务的pipeline。

例如，使用zero-shot-classification进行分类：

classifier = pipeline("zero-shot-classification")
result = classifier(
    "i've been waiting for a huggingface course my whole life.",
    candidate_labels=["positive", "negative"]
)

print(result)

输出：

{'sequence': "i've been waiting for a huggingface course my whole life.",
 'labels': ['positive', 'negative'],
 'scores': [0.9943647985458374, 0.00563523546639061]}

这些pipeline的具体例子可见：transformer models - hugging face course

3、各种任务的代表模型

model	examples	tasks
encoder 编码器模型	albert, bert, distilbert, electra, roberta	sentence classification, named entity recognition, extractive question answering 适合需要理解完整句子的任务，例如句子分类、命名实体识别（以及更一般的单词分类）和提取式问答
decoder 解码器模型	ctrl, gpt, gpt-2, transformer xl	text generation 解码器模型的预训练通常围绕预测句子中的下一个单词。这些模型最适合涉及文本生成的任务
encoder-decoder 序列到序列模型	bart, t5, marian, mbart	summarization, translation, generative question answering 序列到序列模型最适合围绕根据给定输入生成新句子的任务，例如摘要、翻译或生成式问答。

本节测试：transformer models - hugging face course

二、使用 using transformers

1、pipeline 背后的流程

在使用pipeline时，背后的流程包括三个主要步骤：分词器（tokenizer）、模型（model）和后处理（post-processing）。通过这三个步骤的组合，pipeline能够接收原始文本作为输入，并使用预训练模型进行处理，最终生成相应的输出。这种流程的设计使得使用transformers库进行文本处理变得简单且高效。

1）分词器（tokenizer）
分词器用于将输入的原始文本分割成单词或子词的序列，以便模型能够理解和处理。它将文本转换为模型可以接受的输入格式。在transformers库中，可以使用autotokenizer类及其from_pretrained()方法来实例化一个适用于特定预训练模型的分词器。
与其他神经网络一样，transformer 模型不能直接处理原始文本，故使用分词器进行预处理。使用autotokenizer类及其from_pretrained()方法。

from transformers import autotokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = autotokenizer.from_pretrained(checkpoint)

若要指定我们想要返回的张量类型（pytorch、tensorflow 或普通 numpy），我们使用return_tensors参数

raw_inputs = [
    "i've been waiting for a huggingface course my whole life.",
    "i hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=true, truncation=true, return_tensors="pt")
print(inputs)

pytorch 张量的结果：

输出本身是一个包含两个键的字典，input_ids和attention_mask。

{
    'input_ids': tensor([
        [  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172, 2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,     0,     0,     0,     0,     0,     0]
    ]), 
    'attention_mask': tensor([
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    ])
}

2）model
模型是预训练的transformer模型，它接收分词器处理后的输入，并对其进行处理以生成相应的输出。根据不同的任务，可以选择不同的预训练模型，如bert、gpt等。在pipeline中，模型会自动从hugging face模型库中下载和加载预训练模型。
transformers 提供了一个automodel类，它也有一个from_pretrained()方法：

from transformers import automodel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = automodel.from_pretrained(checkpoint)

如果我们将预处理过的输入提供给我们的模型，我们可以看到：

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

# 输出 
# torch.size([2, 16, 768])

transformers 中有许多不同的架构可用，每一种架构都围绕着处理特定任务而设计，清单：

*model (retrieve the hidden states)
*forcausallm
*formaskedlm
*formultiplechoice
*forquestionanswering
*forsequenceclassification
*fortokenclassification
and others

3）post-processing
后处理步骤用于根据任务的要求对模型的输出进行处理和解释。例如，对于情感分析任务，后处理步骤可能会将模型输出的概率分数转换为标签（如"positive"、“negative”）；对于问答任务，后处理步骤可能会从模型输出中提取答案。后处理步骤可以根据具体任务的需求进行自定义。
模型最后一层输出的原始非标准化分数。要转换为概率，它们需要经过一个softmax层（所有 transformers 模型都输出 logits，因为用于训练的损耗函数一般会将最后的激活函数(如softmax)与实际损耗函数(如交叉熵)融合。

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

2. models

1）创建transformer

from transformers import bertconfig, bertmodel

# building the config
config = bertconfig()

# building the model from the config
model = bertmodel(config)

2）不同的加载方式

from transformers import bertmodel

model = bertmodel.from_pretrained("bert-base-cased")

3）保存模型

model.save_pretrained("directory_on_my_computer")

4）使用transformer model

sequences = ["hello!", "cool.", "nice!"]
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

import torch

model_inputs = torch.tensor(encoded_sequences)

3. tokenizers

1）loading and saving

from transformers import berttokenizer

tokenizer = berttokenizer.from_pretrained("bert-base-cased")
tokenizer("using a transformer network is simple")

# 输出
'''
{'input_ids': [101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
'''

# 保存
tokenizer.save_pretrained("directory_on_my_computer")

2）tokenization

from transformers import autotokenizer

tokenizer = autotokenizer.from_pretrained("bert-base-cased")

sequence = "using a transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens) # 输出 : ['using', 'a', 'transform', '##er', 'network', 'is', 'simple']

#  从token 到输入 id
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids) # 输出：[7993, 170, 11303, 1200, 2443, 1110, 3014]

3） decoding

decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string) # 输出：'using a transformer network is simple'

4. 处理多个序列 handling multiple sequences

1) 模型需要一批输入 models expect a batch of inputs

将数字列表转换为张量并将其发送到模型：

import torch
from transformers import autotokenizer, automodelforsequenceclassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = autotokenizer.from_pretrained(checkpoint)
model = automodelforsequenceclassification.from_pretrained(checkpoint)

sequence = "i've been waiting for a huggingface course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("input ids:", input_ids)

output = model(input_ids)
print("logits:", output.logits)

# 输出
'''
input ids: [[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607, 2026,  2878,  2166,  1012]]
logits: [[-2.7276,  2.8789]]
'''

2) 填充输入 padding the inputs

model = automodelforsequenceclassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

# 输出
'''
tensor([[ 1.5694, -1.3895]], grad_fn=<addmmbackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<addmmbackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<addmmbackward>)
'''

5. 总结 putting it all together

我们已经探索了分词器的工作原理，并研究了分词 tokenizers、转换为输入 id conversion to input ids、填充 padding、截断 truncation和注意力掩码 attention masks。transformers api 可以通过高级函数为我们处理所有这些。

from transformers import autotokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = autotokenizer.from_pretrained(checkpoint)

sequence = "i've been waiting for a huggingface course my whole life."

model_inputs = tokenizer(sequence)



# 可以标记单个序列
sequence = "i've been waiting for a huggingface course my whole life."
model_inputs = tokenizer(sequence)

# 还可以一次处理多个序列
sequences = ["i've been waiting for a huggingface course my whole life.", "so have i!"]
model_inputs = tokenizer(sequences)



# 可以根据几个目标进行填充
# will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# will pad the sequences up to the model max length
# (512 for bert or distilbert)
model_inputs = tokenizer(sequences, padding="max_length")

# will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)



# 还可以截断序列
sequences = ["i've been waiting for a huggingface course my whole life.", "so have i!"]

# will truncate the sequences that are longer than the model max length
# (512 for bert or distilbert)
model_inputs = tokenizer(sequences, truncation=true)

# will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=true)



# 可以处理到特定框架张量的转换，然后可以将其直接发送到模型。
sequences = ["i've been waiting for a huggingface course my whole life.", "so have i!"]

# returns pytorch tensors
model_inputs = tokenizer(sequences, padding=true, return_tensors="pt")

# returns tensorflow tensors
model_inputs = tokenizer(sequences, padding=true, return_tensors="tf")

# returns numpy arrays
model_inputs = tokenizer(sequences, padding=true, return_tensors="np")

special tokens

分词器在开头添加特殊词[cls]，在结尾添加特殊词[sep]。

sequence = "i've been waiting for a huggingface course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

# 输出
'''
[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]
'''

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

# 输出
'''
"[cls] i've been waiting for a huggingface course my whole life. [sep]"
"i've been waiting for a huggingface course my whole life."
'''



# 总结：从分词器到模型
import torch
from transformers import autotokenizer, automodelforsequenceclassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = autotokenizer.from_pretrained(checkpoint)
model = automodelforsequenceclassification.from_pretrained(checkpoint)
sequences = ["i've been waiting for a huggingface course my whole life.", "so have i!"]

tokens = tokenizer(sequences, padding=true, truncation=true, return_tensors="pt")
output = model(**tokens)

本节测试：transformer models - hugging face course

【NLP-04】tranformers库保姆级使用教程---以BERT模型为例

2024年08月01日 • 机器学习 •我要评论

安装

一、模型简介 transformer models

1、情感分析（sentiment analysis）：

2、其他可用的pipeline：

3、各种任务的代表模型

二、使用 using transformers

1、pipeline 背后的流程

2. models

1）创建transformer

2）不同的加载方式

3）保存模型

4）使用transformer model

3. tokenizers

1）loading and saving

2）tokenization

3） decoding

4. 处理多个序列 handling multiple sequences

1) 模型需要一批输入 models expect a batch of inputs

2) 填充输入 padding the inputs

5. 总结 putting it all together

相关文章:

[自然语言处理] 自然语言处理库spaCy使用指北

【多模态】14、Segment Anything | Meta 推出超强悍可分割一切的模型 SAM

发表评论


验证码：

【NLP-04】tranformers库保姆级使用教程---以BERT模型为例

2024年08月01日 • 机器学习 •我要评论

安装

一、模型简介 transformer models

1、情感分析（sentiment analysis）：

2、其他可用的pipeline：

3、各种任务的代表模型

二、 使用 using transformers

1、pipeline 背后的流程

2. models

1）创建transformer

2）不同的加载方式

3）保存模型

4）使用transformer model

3. tokenizers

1）loading and saving

2）tokenization

3） decoding

4. 处理多个序列 handling multiple sequences

1) 模型需要一批输入 models expect a batch of inputs

2) 填充输入 padding the inputs

5. 总结 putting it all together

相关文章:

[自然语言处理] 自然语言处理库spaCy使用指北

【多模态】14、Segment Anything | Meta 推出超强悍可分割一切的模型 SAM

发表评论

二、使用 using transformers