|
a |
|
b/QueryExtraction/extraction.py |
|
|
1 |
|
|
|
2 |
from .spacy_test_query import pytextrank_extract |
|
|
3 |
from .gensim_test_query import gensim_extract |
|
|
4 |
from .rake_test_query import rake_extract |
|
|
5 |
from .rakun_test_query import rakun_extract |
|
|
6 |
from .yake_test_query import yake_extract |
|
|
7 |
from .keybert_test_query import keybert_extract |
|
|
8 |
|
|
|
9 |
# t1=''' |
|
|
10 |
# In the field of computer vision, researchers have repeatedly shown the value of transfer learning — pre-training a neural network model on a known task, for instance ImageNet, and then performing fine-tuning — using the trained neural network as the basis of a new purpose-specific model. In recent years, researchers have been showing that a similar technique can be useful in many natural language tasks. |
|
|
11 |
# A different approach, which is also popular in NLP tasks and exemplified in the recent ELMo paper, is feature-based training. In this approach, a pre-trained neural network produces word embeddings which are then used as features in NLP models. |
|
|
12 |
# How BERT works |
|
|
13 |
# BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed workings of Transformer are described in a paper by Google. |
|
|
14 |
# As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word). |
|
|
15 |
# The chart below is a high-level description of the Transformer encoder. The input is a sequence of tokens, which are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index. |
|
|
16 |
# When training language models, there is a challenge of defining a prediction goal. Many models predict the next word in a sequence (e.g. “The child came home from ___”), a directional approach which inherently limits context learning. To overcome this challenge, BERT uses two training strategies: |
|
|
17 |
# Masked LM (MLM) |
|
|
18 |
# Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. In technical terms, the prediction of the output words requires: |
|
|
19 |
# Adding a classification layer on top of the encoder output. |
|
|
20 |
# Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension. |
|
|
21 |
# Calculating the probability of each word in the vocabulary with softmax. |
|
|
22 |
# |
|
|
23 |
# ''' |
|
|
24 |
# |
|
|
25 |
# my_list1 = ['computer vision', 'bert', 'transformer', 'neural network', 'imagenet', 'Masked LM', |
|
|
26 |
# 'embedding matrix','softmax'] |
|
|
27 |
|
|
|
28 |
|
|
|
29 |
|
|
|
30 |
def evaluate(list, my_list1): |
|
|
31 |
|
|
|
32 |
count = 0 |
|
|
33 |
for kw in list: |
|
|
34 |
if kw in my_list1: |
|
|
35 |
count+=1 |
|
|
36 |
|
|
|
37 |
return count |
|
|
38 |
|
|
|
39 |
def main_extraction(words_search, text): |
|
|
40 |
t1 = text |
|
|
41 |
print ('Extraction:') |
|
|
42 |
|
|
|
43 |
print ('SpaCy TextRank:') |
|
|
44 |
extracted = pytextrank_extract(t1,30) |
|
|
45 |
print ('Captured ',evaluate(extracted, words_search)) |
|
|
46 |
print (extracted) |
|
|
47 |
print ('-'*30) |
|
|
48 |
|
|
|
49 |
print ('Gensim TextRank:') |
|
|
50 |
extracted = gensim_extract(t1,0.3) |
|
|
51 |
print ('Captured ',evaluate(extracted, words_search)) |
|
|
52 |
print (extracted) |
|
|
53 |
print ('-'*30) |
|
|
54 |
|
|
|
55 |
print ('Rake:') |
|
|
56 |
extracted = rake_extract(t1,30) |
|
|
57 |
print ('Captured ',evaluate(extracted, words_search)) |
|
|
58 |
print (extracted) |
|
|
59 |
print ('-' * 30) |
|
|
60 |
|
|
|
61 |
print ('Rakun:') |
|
|
62 |
extracted = rakun_extract(t1) |
|
|
63 |
print ('Captured ',evaluate(extracted, words_search)) |
|
|
64 |
print (extracted) |
|
|
65 |
print ('-' * 30) |
|
|
66 |
|
|
|
67 |
|
|
|
68 |
print ('Yake:') |
|
|
69 |
extracted = yake_extract(t1) |
|
|
70 |
print ('Captured ',evaluate(extracted, words_search)) |
|
|
71 |
print (extracted) |
|
|
72 |
print ('-' * 30) |
|
|
73 |
|
|
|
74 |
|
|
|
75 |
print ('KeyBERT:') |
|
|
76 |
extracted = keybert_extract(t1) |
|
|
77 |
print ('Captured ',evaluate(extracted, words_search)) |
|
|
78 |
print (extracted) |