你挡我一时,挡不了我一世
前文链接:https://write-bug.com/article/2689.html
什么是word embedding? 简单的来说就是将一个特征转换为一个向量。在推荐系统当中,我们经常会遇到离散特征,如userid、itemid。对于离散特征,我们一般的做法是将其转换为one-hot,但对于itemid这种离散特征,转换成one-hot之后维度非常高,但里面只有一个是1,其余都为0。这种情况下,我们的通常做法就是将其转换为embedding。
word embedding为什么翻译成词嵌入模型?一个向量可以想象成多维空间,那么在多维空间里面找到一个点来代表这个词,相当于把这个词嵌入到了多维空间里,所以可以称为词嵌入模型。
one-hot编码形式
输入:
apple on a apple tree – Bag of Words: ["apple", "on", "a", "tree"]
– Word Embedding的输出就是每个word的向量表示
最简单直接的是one-hot编码形式:那么每个word都对应了一种数值表示
例如,apple对应的vector就是[1, 0, 0, 0],a对应的vector就是[0, 0, 1, 0]
基于频率:(本质:one-hot)
Count vector:需要穷举所有单词,形成非常庞大、稀疏的矩阵,穷举时,计算每词的词频TF
每个文档用词向量的组合来表示,每个词的权重用其出现的次数来表示
假设语料库内容如下:
– D1: He is a boy.
– D2: She is a girl, good girl.
那么可以构建如下2 × 7维的矩阵
tfidf vector:同时考虑了TF与IDF
IDF=log(N/n)。 – N代表语料库中文档的总数 – n代表某个word在几个文档中出现过;
Co-Occurrence vector:只考虑滑动窗口内单词的共现,等于考虑了上下文之间的关系
前面两类方法,每个字都是独立的,缺少了语义的学习
例如: He is not lazy. He is intelligent. He is smart.
如果Context Window大小为2,那么可以得到如下的共现矩阵:
he和is计算方法(窗口共现):
基于模型:
CBOW
Skip-Gram
TF-IDF-》BM25
Co-Occurrence-》WordtoVector
定位:搜索引擎相关性评分
通常我们在搜索信息时,搜索的句子为query,关联出的排序列表为item,即:
query -》item
那么可分词query-》token1 token2 token3
同时可把item -》doc-》token token token
词在文章中的评分:token1 -》doc :score1(tfidf)
所以相关性评分query-》item:score1+score2+…
公式可表示为:
S(q,d)=∑Wi*R(qi,d)
Wi用idf表示:
N为索引中的全部文档数
n(qi)为包含了qi的文档数
相关性评分R:
k1,k2,b为调节因子,一般k1=2,b=0.75
b是调整文档长度对相关性影响的大小 • b越大,文档长度影响R越大,反之越小
fi为qi在d中的出现频率 =TF
qfi为qi在Query中的出现频率 =query中的TF
在之前的tfidf中TF越大最后的分数也越大,而通常情况下,文章越长,TF大的可能性越大,没有考虑到文章长度的问题,这里BM25考虑到了这个问题,如果文章过长就要给定一定的打压,那么如何界定文章长度呢?除以avgdl,即所有文章平均长度
dl为文档d的长度
avgdl为所有文档的平均长度
公式简化:
绝大部分情况下,qi在Query中只会出现一次,即qfi=1
BM25实践:
1.gensim word2vec
语料库-》每个词的50维向量即word embedding
from gensim.models import word2vec
model=word2vec.Word2Vec(sentences,size=50,window=5,min_count=1,workers=6)
model.wv.save_word2vec_format(file_voc, binary=False)
2.coumpute_idf
每个词的idf
file_corpus='../data/file_corpus.txt'
file_voc='../data/voc.txt'
file_idf='../data/idf.txt'
class ComIdf(object):
def __init__(self,file_corpus,file_voc,file_idf):
self.file_corpus=file_corpus
self.file_voc=file_voc
self.file_idf=file_idf
self.voc=load_voc(self.file_voc)
self.corpus_data=self.load_corpus()
self.N=len(self.corpus_data)
def load_corpus(self):
input_data = codecs.open(self.file_corpus, 'r', encoding='utf-8')
return input_data.readlines()
def com_idf(self,word):
n = 0
for _,line in enumerate(self.corpus_data):
n+=line.count(word)
idf=math.log(1.0*self.N/n+1)
return {word:idf}
def parts(self):
words=set(self.voc.keys())
multiprocessing.freeze_support()
cores=multiprocessing.cpu_count()
pool=multiprocessing.Pool(processes=cores-2)
reuslt=pool.map(self.com_idf,words)
idf_dict=dict()
for r in reuslt:
k=list(r.keys())[0]
v=list(r.values())[0]
idf_dict[k]=idf_dict.get(k,0)+v
with codecs.open(self.file_idf,'w',encoding='utf-8') as f:
f.write(json.dumps(idf_dict,ensure_ascii=False,indent=2,sort_keys=False))
if __name__ == '__main__':
t1 = time.time()
IDF=ComIdf(file_corpus,file_voc,file_idf)
IDF.parts()
3.get_sentence
分割句子,取出最常见句子1w个
def get_sentence():
file_corpus=codecs.open('../data/file_corpus.txt','r',encoding='utf-8')
file_sentence=codecs.open('../data/file_sentence.txt','w',encoding='utf-8')
st=dict()
for _,line in enumerate(file_corpus):
line=line.strip()
blocks=re_han.split(line)
for blk in blocks:
if re_han.match(blk) and len(blk)>10:
st[blk]=st.get(blk,0)+1
st=sorted(st.items(),key=lambda x:x[1],reverse=True)
for s in st[:10000]:
file_sentence.write(s[0]+'\n')
file_corpus.close()
file_sentence.close()
get_sentence()
4.test
各种算法计算打分:cos\idf\bm25\jaccard
file_voc='./data/voc.txt'
file_idf='./data/idf.txt'
file_userdict='./data/medfw.txt'
class SSIM(object):
def __init__(self):
t1 = time.time()
self.voc=load_voc(file_voc)
print("Loading word2vec vector cost %.3f seconds...\n" % (time.time() - t1))
t1 = time.time()
self.idf=load_idf(file_idf)
print("Loading idf data cost %.3f seconds...\n" % (time.time() - t1))
jieba.load_userdict(file_userdict)
def M_cosine(self,s1,s2):
s1_list=jieba.lcut(s1)
s2_list=jieba.lcut(s2)
v1=np.array([self.voc[s] for s in s1_list if s in self.voc])
v2=np.array([self.voc[s] for s in s2_list if s in self.voc])
v1=v1.sum(axis=0)
v2=v2.sum(axis=0)
sim=1-spatial.distance.cosine(v1,v2)
return sim
def M_idf(self,s1, s2):
v1, v2 = [], []
s1_list = jieba.lcut(s1)
s2_list = jieba.lcut(s2)
for s in s1_list:
idf_v = self.idf.get(s, 1)
if s in self.voc:
v1.append(1.0 * idf_v * self.voc[s])
for s in s2_list:
idf_v = self.idf.get(s, 1)
if s in self.voc:
v2.append(1.0 * idf_v * self.voc[s])
v1 = np.array(v1).sum(axis=0)
v2 = np.array(v2).sum(axis=0)
sim = 1 - spatial.distance.cosine(v1, v2)
return sim
def M_bm25(self,s1, s2, s_avg=10, k1=2.0, b=0.75):
bm25 = 0
s1_list = jieba.lcut(s1)
for w in s1_list:
idf_s = self.idf.get(w, 1)
bm25_ra = s2.count(w) * (k1 + 1)
bm25_rb = s2.count(w) + k1 * (1 - b + b * len(s2) / s_avg)
bm25 += idf_s * (bm25_ra / bm25_rb)
return bm25
def M_jaccard(self,s1, s2):
s1 = set(s1)
s2 = set(s2)
ret1 = s1.intersection(s2)
ret2 = s1.union(s2)
jaccard = 1.0 * len(ret1)/ len(ret2)
return jaccard
def ssim(self,s1,s2,model='cosine'):
if model=='idf':
f_ssim=self.M_idf
elif model=='bm25':
f_ssim=self.M_bm25
elif model=='jaccard':
f_ssim=self.M_jaccard
else:
f_ssim = self.M_cosine
sim=f_ssim(s1,s2)
return sim
sm=SSIM()
ssim=sm.ssim
def test():
test_data=[u'临床表现及实验室检查即可做出诊断',
u'面条汤等容易消化吸收的食物为佳',
u'每天应该摄入足够的维生素A',
u'视患者情况逐渐恢复日常活动',
u'术前1天开始预防性运用广谱抗生素']
model_list=['cosine','idf','bm25','jaccard']
file_sentence=codecs.open('./data/file_sentence.txt','r',encoding='utf-8')
train_data=file_sentence.readlines()
for model in model_list:
t1 = time.time()
dataset=dict()
result=dict()
for s1 in test_data:
dataset[s1]=dict()
for s2 in train_data:
s2=s2.strip()
if s1!=s2:
sim=similarity.ssim(s1,s2,model=model)
dataset[s1][s2]=dataset[s1].get(s2,0)+sim
for r in dataset:
top=sorted(dataset[r].items(),key=lambda x:x[1],reverse=True)
result[r]=top[0]
with codecs.open('./data/test_result.txt','a') as f:
f.write('--------------The result of %s method------------------\n '%model)
f.write('\tThe computing cost %.3f seconds\n'% (time.time() - t1))
f.write(json.dumps(result, ensure_ascii=True, indent=2, sort_keys=False))
f.write('\n\n')
file_sentence.close()
test()
继承了之前的窗口滑动思想,得到更好的语义理解
涉及的技术点:
– Hierarchical softmax分层softmax
– Negative sampling负采样
两种方式:
CBOW:根据前后词,预测中间词
Skip-Gram :根据中间词预测前后词,工业用的更多
一句话包含了很多单词,时间窗口选择到了一个中间词,w3,假设左边右边各学习两个,中间词假设不知道,那么根据w1245猜测中间这个w3词,故意装作w3不知道做预测,那么w3就是中心词label,w1245就是周围词训练数据,随着时间窗口滑动,同理w4又继续成为了中间词,那么由此类推形成了很多训练样本
Input layer:由one-hot编码的输入上下文{x1,…,xC}组成, 其中窗口大小为C,词汇表大小为V
Hidden layer:隐藏层是N维的向量
Output layer:被one-hot编码的输出单词y
被one-hot编码的输入向量通过一个V×N维的权重矩阵W连接到隐藏层; 隐藏层通过一个N×V的权重矩阵W′连接到输出层
不管输入多少词,相加并都映射为中间得隐层向量上,再做一个softmax输出
样本怎么得到的?
input word和output word都是one-hot编码的向量。最终模型的输出是一个概率分布
softmax分类需要所有单词:这里是50002个单词,需要遍历一遍,那么每次计算都需要一定时间,不管是多次还是一次时间成本都很高,所以引出了分层softmax:
Hierarchical softmax哈夫曼树
是二叉树的一种特殊形式,又称为最优二叉树,其主要作用在于数据压缩和编码长度的优化
定义:给定n个权值作为n个叶子结点,构造一棵二叉树,若带权路径长度(WPL)达到最小,称这样的二叉 树为最优二叉树,也称为霍夫曼树(Huffman Tree)
WPL:树的带权路径长度,所有叶子结点的带权路径长度之和
让wpl最小,压缩数据信息
主要用于编码:从根节点出发,向左为0,向右为1
等长编码和哈夫曼编码
生成:
每个节点按权重大小从小到大排序
两个最小的为叶子节点相加成为父节点并和其他最小的成两个叶子节点,最终形成哈夫曼树
之前做softmax,50002个单词需要从头到尾计算一遍,把词表做成树,每个叶子节点都是一个单词,树的每个结点都是一个sigmoid,来了一个单词做一个lr二分类向左还是向右分类直至最后叶子节点为所要的词
Negative sampling负采样
思想:每次让一个训练样本仅更新一小部分权重
需要离错误的样本更远
不对所有输出神经元对应的权值进行更新,只是随机选取几个“negative”词,更新他们对应的权重。当然, 我们也会更新”positive”的对应权值
随机取的样本只要不共现就是负样本,概率大的会多选,即出现多的单词自然被取出的概率越大
实际论文实现:
一次softmax
多次softmax
Skip-Gram的训练时间更长,更准确
Cbow训练速度快,Skip-Gram更好处理生僻字
Word2Vec实践:
import sys
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as tud
from torch.nn.parameter import Parameter
from collections import Counter
import numpy as np
import random
import math
import pandas as pd
import scipy
import sklearn
from sklearn.metrics.pairwise import cosine_similarity
# 超参数
K = 100 # number of negative samples
C = 3 # nearby words threshold
NUM_EPOCHS = 2 # The number of epochs of training
MAX_VOCAB_SIZE = 30000 # the vocabulary size
BATCH_SIZE = 128 # the batch size
LEARNING_RATE = 0.2 # the initial learning rate
EMBEDDING_SIZE = 100
# 分词
def word_tokenize(text):
return text.split()
with open("./data/text8", "r") as fin:
text = fin.read()
text = [w for w in word_tokenize(text.lower())]
print(text[1])
vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE - 1))#计数器,取前30000个高频元素
vocab["<unk>"] = len(text) - np.sum(list(vocab.values()))
print (np.sum(list(vocab.values())))
print (len(text))
print (vocab["<unk>"])
print (len(vocab))
#output:
originated
17005207
17005207
690103
30000
idx_to_word = [word for word in vocab.keys()]
word_to_idx = {word: i for i, word in enumerate(idx_to_word)}
word_counts = np.array([count for count in vocab.values()], dtype=np.float32)
word_freqs = word_counts / np.sum(word_counts)
word_freqs = word_freqs ** (3. / 4.)
word_freqs = word_freqs / np.sum(word_freqs) # 用来做 negative sampling
VOCAB_SIZE = len(idx_to_word)
print(VOCAB_SIZE)
#text 每个单词
#word_to_idx 单词id化:dict
#idx_to_word 每个单词
#word_freqs 负采样概率
#word_counts 单词计数列表
#数据样本列举:
class WordEmbeddingDataset(tud.Dataset):
def __init__(self, text, word_to_idx, idx_to_word, word_freqs, word_counts):
''' text: a list of words, all text from the training dataset
word_to_idx: the dictionary from word to idx
idx_to_word: idx to word mapping
word_freq: the frequency of each word
word_counts: the word counts
'''
super(WordEmbeddingDataset, self).__init__()
self.text_encoded = [word_to_idx.get(t, VOCAB_SIZE - 1) for t in text]
self.text_encoded = torch.Tensor(self.text_encoded).long()
self.word_to_idx = word_to_idx
self.idx_to_word = idx_to_word
self.word_freqs = torch.Tensor(word_freqs)
self.word_counts = torch.Tensor(word_counts)
def __len__(self):
''' 返回整个数据集(所有单词)的长度
'''
return len(self.text_encoded)
def __getitem__(self, idx):
''' 这个function返回以下数据用于训练
- 中心词
- 这个单词附近的(positive)单词
- 随机采样的K个单词作为negative sample
'''
center_word = self.text_encoded[idx]
pos_indices = list(range(idx - C, idx)) + list(range(idx + 1, idx + C + 1)#窗口左3右3
pos_indices = [i % len(self.text_encoded) for i in pos_indices]#去除类似第一次左边无词这种情况
pos_words = self.text_encoded[pos_indices]
neg_words = torch.multinomial(self.word_freqs, K * pos_words.shape[0], True)
return center_word, pos_words, neg_words
dataset = WordEmbeddingDataset(text, word_to_idx, idx_to_word, word_freqs, word_counts)
dataloader = tud.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)
class EmbeddingModel(nn.Module):
def __init__(self, vocab_size, embed_size):
''' 初始化输出和输出embedding
'''
super(EmbeddingModel, self).__init__()
self.vocab_size = vocab_size
self.embed_size = embed_size
initrange = 0.5 / self.embed_size
self.out_embed = nn.Embedding(self.vocab_size, self.embed_size, sparse=False)
self.out_embed.weight.data.uniform_(-initrange, initrange)
self.in_embed = nn.Embedding(self.vocab_size, self.embed_size, sparse=False)
self.in_embed.weight.data.uniform_(-initrange, initrange)
def forward(self, input_labels, pos_labels, neg_labels):
'''
input_labels: 中心词, [batch_size]
pos_labels: 中心词周围 context window 出现过的单词 [batch_size * (window_size * 2)]
neg_labelss: 中心词周围没有出现过的单词,从 negative sampling 得到 [batch_size, (window_size * 2 * K)]
return: loss
'''
input_embedding = self.in_embed(input_labels) # B * embed_size
pos_embedding = self.out_embed(pos_labels) # B * (2*C) * embed_size
neg_embedding = self.out_embed(neg_labels) # B * (2*C * K) * embed_size
print("input_embedding size:", input_embedding.size())
print("pos_embedding size:", pos_embedding.size())
print("neg_embedding size:", neg_embedding.size())
log_pos = torch.bmm(pos_embedding, input_embedding.unsqueeze(2)).squeeze() # B * (2*C)
log_neg = torch.bmm(neg_embedding, -input_embedding.unsqueeze(2)).squeeze() # B * (2*C*K)先拓展再压缩
print("log_pos size:", log_pos.size())
print("log_neg size:", log_neg.size())
log_pos = F.logsigmoid(log_pos).sum(1)#1为横向压缩,2为竖向压缩,可以试试玩
log_neg = F.logsigmoid(log_neg).sum(1)
loss = log_pos + log_neg
print("log_pos size:", log_pos.size())
print("log_neg size:", log_neg.size())
print("loss size:", loss.size())
return -loss
def input_embeddings(self):
return self.in_embed.weight.data.cpu().numpy()
model = EmbeddingModel(VOCAB_SIZE, EMBEDDING_SIZE)
if USE_CUDA:
model = model.cuda()
if __name__ == '__main__':
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)
for e in range(NUM_EPOCHS):
for i, (input_labels, pos_labels, neg_labels) in enumerate(dataloader):
print(input_labels.size())
print(pos_labels.size())
print(neg_labels.size())
input_labels = input_labels.long()
pos_labels = pos_labels.long()
neg_labels = neg_labels.long()
if USE_CUDA:
input_labels = input_labels.cuda()
pos_labels = pos_labels.cuda()
neg_labels = neg_labels.cuda()
optimizer.zero_grad()
loss = model(input_labels, pos_labels, neg_labels).mean()
loss.backward()
optimizer.step()
if i % 100 == 0:
with open(LOG_FILE, "a") as fout:
fout.write("epoch: {}, iter: {}, loss: {}\n".format(e, i, loss.item()))
print("epoch: {}, iter: {}, loss: {}".format(e, i, loss.item()))
embedding_weights = model.input_embeddings()
torch.save(model.state_dict(), "embedding-{}.th".format(EMBEDDING_SIZE))
使用场景:
1.item title检索 +ANN+word avg,word average: – 对所有输入词向量求和并取平均 – 例如:输入三个4维词向量:(1,2,3,4),(9,6,11,8),(5,10,7,12) – 那么我们word2vec映射后的词向量就是(5,6,7,8)
2.模型初始化:处理特征等
3.近义词(语义理解):可通过计算词向量的加减,与相似度可进行语义理解
实践例:
def evaluate(filename, embedding_weights):
if filename.endswith(".csv"):
data = pd.read_csv(filename, sep=",")
else:
data = pd.read_csv(filename, sep="\t")
human_similarity = []
model_similarity = []
for i in data.iloc[:, 0:2].index:
word1, word2 = data.iloc[i, 0], data.iloc[i, 1]
if word1 not in word_to_idx or word2 not in word_to_idx:
continue
else:
word1_idx, word2_idx = word_to_idx[word1], word_to_idx[word2]
word1_embed, word2_embed = embedding_weights[[word1_idx]], embedding_weights[[word2_idx]]
model_similarity.append(float(sklearn.metrics.pairwise.cosine_similarity(word1_embed, word2_embed)))
human_similarity.append(float(data.iloc[i, 2]))
return scipy.stats.spearmanr(human_similarity, model_similarity)# , model_similarity
def find_nearest(word):
index = word_to_idx[word]
embedding = embedding_weights[index]
cos_dis = np.array([scipy.spatial.distance.cosine(e, embedding) for e in embedding_weights])#1-cosine
return [idx_to_word[i] for i in cos_dis.argsort()[:10]]
model.load_state_dict(torch.load("model/embedding-{}.th".format(EMBEDDING_SIZE), map_location='cpu'))
embedding_weights = model.input_embeddings()
# 找相似的单词
for word in ["good", "computer", "china", "mobile", "study"]:
print(word, find_nearest(word))
# 单词之间的关系,寻找nearest neighbors
man_idx = word_to_idx["man"]
king_idx = word_to_idx["king"]
woman_idx = word_to_idx["woman"]
embedding = embedding_weights[woman_idx] - embedding_weights[man_idx] + embedding_weights[king_idx]
cos_dis = np.array([scipy.spatial.distance.cosine(e, embedding) for e in embedding_weights])
for i in cos_dis.argsort()[:20]:
print(idx_to_word[i])
keyboard_arrow_left上一篇 : 【Cocos Creator实战教程(4)】——炸弹人(TiledMap相关) 中山大学智慧健康服务平台应用开发-Broadcast使用 : 下一篇keyboard_arrow_right