2016-10-23 2 views
2

Я пытаюсь удалить стоп-слова из строк.Регулярное выражение ведет себя неожиданно при использовании некоторых конкретных слов

Я столкнулся с неожиданными результатами с некоторыми комбинациями слов. Ниже приведен самый маленький пример, который я мог бы сделать, демонстрируя такое поведение.

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import re 
import json 

en = ''' 
["different","doesn't","doing","don't","done","down","downwards","during","e","each","edu","eg","eight","either","else","elsewhere","enough","entirely","especially","et","etc","even","ever","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","f","far","few","fifth","first","five","followed","following","follows","for","former","formerly","forth","four","from","further","furthermore","g","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","h","had","hadn't","known","knows","l","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","little","look","looking","looks","ltd","m","mainly","many","may","maybe","me","mean","meanwhile","merely","might","more","moreover","most","mostly","much","must","my","myself","n","name","namely","nd","near"] 
''' 

fr = ''' 
["a","abord","absolument","afin","ah","ai","aie","ailleurs","ainsi","ait","allaient","allo","allons","allô","alors","anterieur","anterieure","anterieures","apres","après","as","assez","attendu","au","aucun","aucune","aujourd","aujourd'hui","aupres","auquel","aura","auraient","aurait","auront","aussi","autre","autrefois","autrement","autres","autrui","aux","auxquelles","auxquels","avaient","avais","avait","avant","avec","avoir","avons","ayant","b","bah","bas","basee","bat","beau","beaucoup","bien","bigre","boum","bravo","brrr","c","car","ce","ceci","cela","celle","celle-ci","celle-là","celles","celles-ci","celles-là","celui","celui-ci","celui-là","cent","cependant","certain","certaine","certaines","certains","certes","ces","cet","cette","ceux","ceux-ci","ceux-là","chacun","chacune","chaque","cher","chers","chez","chiche","chut","chère","chères","ci","cinq","cinquantaine","cinquante","cinquantième","cinquième","clac","clic","combien","comme","comment","comparable","comparables","compris","concernant","contre","couic","crac","d","da","dans","de","debout","dedans","dehors","deja","delà","depuis","dernier","derniere","derriere","derrière","des","desormais","desquelles","desquels","dessous","dessus","deux","deuxième","deuxièmement","devant","devers","devra","different","differentes","differents","différent","différente","différentes","différents","dire","directe","directement","dit","dite","dits","divers","diverse","diverses","dix","dix-huit","dix-neuf","dix-sept","dixième","doit","doivent","donc","dont","douze","douzième","dring","du","duquel","durant","dès","désormais","e","effet","egale","egalement","egales","eh","elle","elle-même","elles","elles-mêmes","en","encore","enfin","entre","envers","environ","es","est","et","etant","etc","etre","eu","euh","eux","eux-mêmes","exactement","excepté","extenso","exterieur","f","fais","faisaient","faisant","fait","façon","feront","fi","flac","floc","font","g","gens","h","ha","hein","hem","hep","hi","ho","holà","hop","hormis","hors","hou","houp","hue","hui","huit","huitième","hum","hurrah","hé","hélas","i","il","ils","importe","j","je","jusqu","jusque","juste","k","l","la","laisser","laquelle","le","lequel","les","lesquelles","lesquels","leur","leurs","longtemps","lors","lorsque","lui","lui-meme","lui-même","là","lès","m","ma","maint","maintenant","oust","ouste","outre","ouvert","ouverte","ouverts","o|","où","p","paf","pan","par","parce","parfois","parle","parlent","parler","parmi","parseme","partant","particulier","particulière","probante","procedant","proche","près","psitt","pu","puis","puisque","pur","pure","q","qu","quand","quant","quant-à-soi","quanta","quarante","quatorze","quatre","quatre-vingt","quatrième","quatrièmement","que","quel","quelconque","quelle","quelles","quelqu'un","quelque","quelques","quels","qui","quiconque","quinze","quoi","quoique","r","rare","rarement","rares","relative","relativement","remarquable","rend","rendre","restant","reste","restent","restrictif","retour","revoici","revoilà","rien","sa","sacrebleu","sait","sans","sapristi","sauf","se","sein","seize","selon","semblable","tres","trois","troisième","troisièmement","trop","vrai"] 
''' 

stopwords = set(json.loads(en) + json.loads(fr)) 

stopwordsStr = '|'.join(stopwords) 
regex = re.compile(r'\b('+stopwordsStr+r')\b') 

msg = "le vrai commentaire sur les vous tres time foobar" 
print msg 
print regex.sub('', msg) 

Код выше работает, как ожидалось:

$ python debug.py 
le vrai commentaire sur les vous tres time foobar 
    commentaire sur vous time foobar 

В игнорируемых слов правильно удалены.

Теперь! для интересной части. Если я изменю линии, определяющие английские слова к этому:

en = ''' 
["time","different","doesn't","doing","don't","done","down","downwards","during","e","each","edu","eg","eight","either","else","elsewhere","enough","entirely","especially","et","etc","even","ever","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","f","far","few","fifth","first","five","followed","following","follows","for","former","formerly","forth","four","from","further","furthermore","g","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","h","had","hadn't","known","knows","l","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","little","look","looking","looks","ltd","m","mainly","many","may","maybe","me","mean","meanwhile","merely","might","more","moreover","most","mostly","much","must","my","myself","n","name","namely","nd","near"] 
''' 

Я просто добавил ключевое слово «время» в начале. Я мог бы добавить его в любом месте, где бы он сломался.

Теперь я получаю:

$ python ../converse/debug.py 
le vrai commentaire sur les vous tres time foobar 
le vrai commentaire sur vous tres time foobar 

Теперь некоторые стоп-слова не удаляются больше. Я действительно не понимаю, что происходит.

Если я удалю несколько слов из списка стоп-слов, он снова работает правильно, например, если я удаляю «не» из английского списка.

+0

Вы пытались убежать от слов? http://stackoverflow.com/questions/280435/escaping-regex-string-in-python Ваша проблема звучит странно ... И для @MosesKoledoye r задает 'raw string' - http://stackoverflow.com/questions/2081640/what-exact-do-u-and-r-string-flags-do-in-python-and-what-are-raw-string-l –

+0

@YotamSalmon Я уже знаю, что это неправильно. –

+0

@YotamSalmon добавляет 'stopwords = [re.escape (stopword) для остановки в секундах]' исправляет это, спасибо ... но на данный момент меня больше интересует, почему он ведет себя так, как об исправлении – MasterScrat

ответ

0

fr В списке есть слово "o|", что приводит к '||' в окончательном регулярном выражении. Парсер отлично справляется с этим. Изменение "o|" до "o" решает проблему.

Или слова могут быть экранированы с помощью re.escape. Тогда ошибка в одном слове не испортила бы все регулярное выражение.

Смежные вопросы