Switch to unified view

a b/QueryExtraction/extraction.py
1
2
from .spacy_test_query import pytextrank_extract
3
from .gensim_test_query import gensim_extract
4
from .rake_test_query import rake_extract
5
from .rakun_test_query import rakun_extract
6
from .yake_test_query import yake_extract
7
from .keybert_test_query import keybert_extract
8
9
# t1='''
10
# In the field of computer vision, researchers have repeatedly shown the value of transfer learning — pre-training a neural network model on a known task, for instance ImageNet, and then performing fine-tuning — using the trained neural network as the basis of a new purpose-specific model. In recent years, researchers have been showing that a similar technique can be useful in many natural language tasks.
11
# A different approach, which is also popular in NLP tasks and exemplified in the recent ELMo paper, is feature-based training. In this approach, a pre-trained neural network produces word embeddings which are then used as features in NLP models.
12
# How BERT works
13
# BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed workings of Transformer are described in a paper by Google.
14
# As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).
15
# The chart below is a high-level description of the Transformer encoder. The input is a sequence of tokens, which are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index.
16
# When training language models, there is a challenge of defining a prediction goal. Many models predict the next word in a sequence (e.g. “The child came home from ___”), a directional approach which inherently limits context learning. To overcome this challenge, BERT uses two training strategies:
17
# Masked LM (MLM)
18
# Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. In technical terms, the prediction of the output words requires:
19
# Adding a classification layer on top of the encoder output.
20
# Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
21
# Calculating the probability of each word in the vocabulary with softmax.
22
#
23
# '''
24
#
25
# my_list1 = ['computer vision', 'bert', 'transformer', 'neural network', 'imagenet', 'Masked LM',
26
#            'embedding matrix','softmax']
27
28
29
30
def evaluate(list, my_list1):
31
32
    count = 0
33
    for kw in list:
34
        if kw in my_list1:
35
            count+=1
36
37
    return count
38
39
def main_extraction(words_search, text):
40
    t1 = text
41
    print ('Extraction:')
42
43
    print ('SpaCy TextRank:')
44
    extracted = pytextrank_extract(t1,30)
45
    print ('Captured ',evaluate(extracted, words_search))
46
    print (extracted)
47
    print ('-'*30)
48
49
    print ('Gensim TextRank:')
50
    extracted = gensim_extract(t1,0.3)
51
    print ('Captured ',evaluate(extracted, words_search))
52
    print (extracted)
53
    print ('-'*30)
54
55
    print ('Rake:')
56
    extracted = rake_extract(t1,30)
57
    print ('Captured ',evaluate(extracted, words_search))
58
    print (extracted)
59
    print ('-' * 30)
60
61
    print ('Rakun:')
62
    extracted = rakun_extract(t1)
63
    print ('Captured ',evaluate(extracted, words_search))
64
    print (extracted)
65
    print ('-' * 30)
66
67
68
    print ('Yake:')
69
    extracted = yake_extract(t1)
70
    print ('Captured ',evaluate(extracted, words_search))
71
    print (extracted)
72
    print ('-' * 30)
73
74
75
    print ('KeyBERT:')
76
    extracted = keybert_extract(t1)
77
    print ('Captured ',evaluate(extracted, words_search))
78
    print (extracted)