2015-02-10 2 views
2

Я использую алгоритм LESK для получения SynSets из текста. Но у меня разные результаты с теми же входами. Является ли это алгоритмом Леска «особенностью», или я что-то делаю неправильно? Далее идет код, я использую:NLTK. Леск возвращает другой результат для того же входа

self.SynSets =[] 
    sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\ 
     Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\ 
     The language provides constructs intended to enable clear programs on both a small and large scale.\ 
     Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\ 
     ") 
    stopwordsList = stopwords.words('english') 
    self.sentNum=0; 
    for sentence in sentences: 
     raw_tokens = word_tokenize(sentence) 
     final_tokens = [token.lower() for token in raw_tokens 
        if(not token in stopwordsList) 
        #and (len(token) > 3) 
        and not token.isdigit()] 
     for token in final_tokens: 
      synset = wsd.lesk(sentence, token) 
      if not synset is None: 
       self.SynSets.append(synset) 

    self.SynSets = set(self.SynSets) 
    self.WriteSynSets() 
    return self 

На выходе я имеющий результаты (первые 3 результатов от 2-х различного запуска):

Synset('allow.v.09') Synset('code.n.03') Synset('coffee.n.01') 
------------ 
Synset('allow.v.09') Synset('argumentation.n.02') Synset('boastfully.r.01') 

Если есть другой (более стабильный), чтобы получить synsets, я буду благодарен за вашу помощь.

Заранее спасибо.


Edited

Для дополнительного примера здесь есть полный сценарий, который я баллотировался в 2 раза:

import nltk 
from nltk.tokenize import sent_tokenize 
from nltk import word_tokenize 
from nltk import wsd 
from nltk.corpus import stopwords 

SynSets =[] 
sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\ 
    Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\ 
    The language provides constructs intended to enable clear programs on both a small and large scale.\ 
    Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\ 
    ") 
stopwordsList = stopwords.words('english') 

for sentence in sentences: 
    raw_tokens = word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence) 
    #removing stopwords and words, smaller than 3 characters 
    final_tokens = [token.lower() for token in raw_tokens 
       if(not token in stopwordsList) 
       #and (len(token) > 3) 
       and not token.isdigit()] 
    for token in final_tokens: 
     synset = wsd.lesk(sentence, token) 
     if not synset is None: 
      SynSets.append(synset) 


SynSets = set(SynSets) 

SynSets = sorted(SynSets) 
with open("synsets.txt", "a") as file: 
    file.write("\n-------------------\n") 
    for synset in SynSets: 
     file.write("{} ".format(str(synset.__str__()))) 
file.close() 

и у меня были эти результаты (первые 4 привели synsets, что было написано в файл для каждой из 2-х раз в то время, когда я запускал программу):

  • Synset ('allow.v.04') Synset ('boas tfully.r.01 ') Synset (' clear.v.11 ') Synset (' code.n.02 ')

  • Synset (' boastfully.r.01 ') Synset (' clear.v.19) ') Synset (' code.n.01 ') Synset (' design.n.04')

РЕШЕНИЕ: Я получил то, что была проблема. После повторной установки python 2.7 все проблемы ушли. Итак, не используйте python 3.x с алгоритмом lesk.

ответ

2

Существует функция WSD для метод леска в последней версии NLTK:

>>> from nltk.wsd import lesk 
>>> from nltk import sent_tokenize 
>>> text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles." 
>>> for sent in sent_tokenize(text): 
...  for word in word_tokenize(sent): 
...    print word, lesk(sent, word), sent 

[из]:

Python Synset('python.n.02') Python is a widely used general-purpose, high-level programming language. 
is Synset('be.v.08') Python is a widely used general-purpose, high-level programming language. 
a Synset('angstrom.n.01') Python is a widely used general-purpose, high-level programming language. 
widely Synset('wide.r.04') Python is a widely used general-purpose, high-level programming language. 
used Synset('use.v.01') Python is a widely used general-purpose, high-level programming language. 
general-purpose None Python is a widely used general-purpose, high-level programming language. 
, None Python is a widely used general-purpose, high-level programming language. 

Кроме того, попробуйте disambiguate() из pywsd (https://github.com/alvations/pywsd):

>>> from pywsd import disambiguate>>> from nltk import sent_tokenize 
>>> text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles." 
>>> for sent in sent_tokenize(text): 
...  print disambiguate(sent, prefersNone=True) 
... 

[выход]:

[('Python', Synset('python.n.02')), ('is', None), ('a', None), ('widely', Synset('widely.r.03')), ('used', Synset('used.a.01')), ('general-purpose', None), (',', None), ('high-level', None), ('programming', Synset('scheduling.n.01')), ('language', Synset('terminology.n.01')), ('.', None)] 
[('Its', None), ('design', Synset('purpose.n.01')), ('philosophy', Synset('philosophy.n.03')), ('emphasizes', Synset('stress.v.01')), ('code', Synset('code.n.03')), ('readability', Synset('readability.n.01')), (',', None), ('and', None), ('its', None), ('syntax', Synset('syntax.n.03')), ('allows', Synset('let.v.01')), ('programmers', Synset('programmer.n.01')), ('to', None), ('express', Synset('express.n.03')), ('concepts', Synset('concept.n.01')), ('in', None), ('fewer', None), ('lines', Synset('wrinkle.n.01')), ('of', None), ('code', Synset('code.n.03')), ('than', None), ('would', None), ('be', None), ('possible', Synset('potential.a.01')), ('in', None), ('languages', Synset('linguistic_process.n.02')), ('such', None), ('as', None), ('C++', None), ('or', None), ('Java', Synset('java.n.03')), ('.', None)] 
[('The', None), ('language', Synset('language.n.01')), ('provides', Synset('provide.v.06')), ('constructs', Synset('concept.n.01')), ('intended', Synset('mean.v.03')), ('to', None), ('enable', None), ('clear', Synset('open.n.01')), ('programs', Synset('program.n.08')), ('on', None), ('both', None), ('a', None), ('small', Synset('small.a.01')), ('and', None), ('large', Synset('large.a.01')), ('scale', Synset('scale.n.10')), ('.', None)] 
[('Python', Synset('python.n.02')), ('supports', Synset('support.n.11')), ('multiple', None), ('programming', Synset('program.v.02')), ('paradigms', Synset('substitution_class.n.01')), (',', None), ('including', Synset('include.v.03')), ('object-oriented', None), (',', None), ('imperative', Synset('imperative.a.02')), ('and', None), ('functional', Synset('functional.a.01')), ('programming', Synset('scheduling.n.01')), ('or', None), ('procedural', Synset('procedural.a.01')), ('styles', Synset('vogue.n.01')), ('.', None)] 

Они не совершенны, но они близки к точному осуществлению Lesk.


EDITED

Чтобы убедиться, что это одни и те же результаты каждый раз, когда вы бегаете, не должно быть никаких STDOUT, когда вы делаете это:

from nltk.wsd import lesk 
from nltk import sent_tokenize, word_tokenize 
text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles." 

lst = [] 
for sent in sent_tokenize(text): 
    lst = [] 
    for word in word_tokenize(sent): 
     lst.append(lesk(sent, word)) 
    for i in range(10): 
     lst2 = [] 
     for word in word_tokenize(sent): 
      lst2.append(lesk(sent, word)) 
     assert lst2 == lst 

Я побежал код в OP в 10 раз, но он дает тот же результат:

import nltk 
from nltk.tokenize import sent_tokenize 
from nltk import word_tokenize 
from nltk import wsd 
from nltk.corpus import stopwords 

def run(): 
    SynSets =[] 
    sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\ 
     Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\ 
     The language provides constructs intended to enable clear programs on both a small and large scale.\ 
     Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\ 
     ") 
    stopwordsList = stopwords.words('english') 

    for sentence in sentences: 
     raw_tokens = word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence) 
     #removing stopwords and words, smaller than 3 characters 
     final_tokens = [token.lower() for token in raw_tokens 
        if(not token in stopwordsList) 
        #and (len(token) > 3) 
        and not token.isdigit()] 
     for token in final_tokens: 
      synset = wsd.lesk(sentence, token) 
      if not synset is None: 
       SynSets.append(synset) 
    return sorted(set(SynSets)) 

run1 = run() 

for i in range(10): 
    assert run1 == run() 
+0

У меня установлен NLTK 3.0.1. И я использую wsd.lesk от NLTK. Probem - это другой результат в выходе. У вас была аналогичная проблема? Для pywsd - спасибо. Я попробую – MisterMe

+0

Вы это правильно называете? Когда я запускаю его 10 раз, нет ничего другого. Как вы назвали леск из НЛТК? – alvas

+0

У меня есть разница, когда я перезапускаю программу. Вы можете увидеть, как я из usin lesk из примера. – MisterMe

Смежные вопросы