2015-12-14 2 views
0

У меня есть строки как таковые из этого набора данных (https://raw.githubusercontent.com/alvations/stasis/master/sts.csv):Разбор колонны линии с одной двойной котировки в Graphlab.SFrame

Dataset Domain Score Sent1 Sent2 
STS2012-gold surprise.OnWN 5.000 render one language in another language restate (words) from one language into another language. 
STS2012-gold surprise.OnWN 3.250 nations unified by shared interests, history or institutions a group of nations having common interests. 
STS2012-gold surprise.OnWN 3.250 convert into absorbable substances, (as if) with heat or chemical process soften or disintegrate by means of chemical action, heat, or moisture. 
STS2012-gold surprise.OnWN 4.000 devote or adapt exclusively to an skill, study, or work devote oneself to a special area of work. 
STS2012-gold surprise.OnWN 3.250 elevated wooden porch of a house a porch that resembles the deck on a ship. 

Я прочитал его в graphlab.SFrame используя read_csv() функцию:

import graphlab 
sts = graphlab.SFrame.read_csv('sts.csv', delimiter='\t', column_type_hints=[str, str, float, str, str]) 

И были линии, которые не обрабатывались. Отслеживающий выглядит следующим образом:

PROGRESS: Unable to parse line "STS2012-gold MSRpar 3.800 "She was crying and scared,' said Isa Yasin, the owner of the store. "She was crying and she was really scared," said Yasin." 
PROGRESS: Unable to parse line "STS2012-gold MSRpar 2.200 "And about eight to 10 seconds down, I hit. "I was in the water for about eight seconds." 
PROGRESS: Unable to parse line "STS2012-gold MSRpar 2.800 "It's a major victory for Maine, and it's a major victory for other states. The Maine program could be a model for other states." 
PROGRESS: Unable to parse line "STS2012-gold MSRpar 4.000 "Right from the beginning, we didn't want to see anyone take a cut in pay. But Mr. Crosby told The Associated Press: "Right from the beginning, we didn't want to see anyone take a cut in pay." 
PROGRESS: Unable to parse line "STS2014-gold deft-forum 0.8 "Then the captain was gone. Then the captain came back." 
PROGRESS: Unable to parse line "STS2014-gold deft-forum 1.8 "Oh, you're such a good person! You're such a bad person!"" 
PROGRESS: Unable to parse line "STS2012-train MSRpar 3.750 "We put a lot of effort and energy into improving our patching process, probably later than we should have and now we're just gaining incredible speed. "We've put a lot of effort and energy into improving our patching progress, p..." 
PROGRESS: Unable to parse line "STS2012-train MSRpar 4.000 "Tomorrow at the Mission Inn, I have the opportunity to congratulate the governor-elect of the great state of California. "I have the opportunity to congratulate the governor-elect of the great state of California, and I'm lookin..." 
PROGRESS: Unable to parse line "STS2012-train MSRpar 3.600 "Unlike many early-stage Internet firms, Google is believed to be profitable. The privately held Google is believed to be profitable." 
PROGRESS: Unable to parse line "STS2012-train MSRpar 4.000 "It was a final test before delivering the missile to the armed forces. State radio said it was the last test before the missile was delivered to the armed forces." 
PROGRESS: 22 lines failed to parse correctly 
PROGRESS: Finished parsing file /home/alvas/git/stasis/sts.csv 
PROGRESS: Parsing completed. Parsed 19075 lines in 0.069578 secs. 

Посмотрите на эти линии, как представляется, проблема, если какой-либо из моих Sent1 или Sent2 столбцов содержит нечетные двойной кавычки.

Использование error_bad_lines отслеживать проблемные линии:

sts = graphlab.SFrame.read_csv('sts.csv', delimiter='\t', column_type_hints=[str, str, float, str, str], 
           error_bad_lines=True) 

Он бросает отслеживающий:

--------------------------------------------------------------------------- 
RuntimeError        Traceback (most recent call last) 
<ipython-input-15-a1ec53597af9> in <module>() 
     1 sts = graphlab.SFrame.read_csv('sts.csv', delimiter='\t', column_type_hints=[str, str, float, str, str], 
----> 2        error_bad_lines=True) 

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in read_csv(cls, url, delimiter, header, error_bad_lines, comment_char, escape_char, double_quote, quote_char, skip_initial_space, column_type_hints, na_values, line_terminator, usecols, nrows, skiprows, verbose, **kwargs) 
    1537         verbose=verbose, 
    1538         store_errors=False, 
-> 1539         **kwargs)[0] 
    1540 
    1541 

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in _read_csv_impl(cls, url, delimiter, header, error_bad_lines, comment_char, escape_char, double_quote, quote_char, skip_initial_space, column_type_hints, na_values, line_terminator, usecols, nrows, skiprows, verbose, store_errors, **kwargs) 
    1097     glconnect.get_client().set_log_progress(False) 
    1098    with cython_context(): 
-> 1099     errors = proxy.load_from_csvs(internal_url, parsing_config, type_hints) 
    1100   except Exception as e: 
    1101    if type(e) == RuntimeError and "CSV parsing cancelled" in e.message: 

/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback) 
    47    if not self.show_cython_trace: 
    48     # To hide cython trace, we re-raise from here 
---> 49     raise exc_type(exc_value) 
    50    else: 
    51     # To show the full trace, we do nothing and let exception propagate 

RuntimeError: Runtime Exception. Unable to parse line "STS2012-gold MSRpar 3.800 "She was crying and scared,' said Isa Yasin, the owner of the store. "She was crying and she was really scared," said Yasin." 
Set error_bad_lines=False to skip bad lines 

Есть ли способ, чтобы решить эту проблему, где мои строки содержат нечетное число двойных кавычек ?

Есть ли способ сделать это без очистки данных (например, выявление проблемных линий, а затем чистой/исправить их, но держать другой SFrame отслеживать чистка/коррекция?


Как проверка исправности, если мы делаем поиск \t в файле сырых CSV, есть вкладка в строках, что дает проблему, но когда graphlab разбирает его, он исчезает:

enter image description here


В другой проверки вменяемости, чтение файла строка за строкой и разделив ее на \t возвращает 5 столбцов для всего файла:

[email protected]:~/git/stasis$ head sts.csv 
Dataset Domain Score Sent1 Sent2 
STS2012-gold surprise.OnWN 5.000 render one language in another language restate (words) from one language into another language. 
STS2012-gold surprise.OnWN 3.250 nations unified by shared interests, history or institutions a group of nations having common interests. 
STS2012-gold surprise.OnWN 3.250 convert into absorbable substances, (as if) with heat or chemical process soften or disintegrate by means of chemical action, heat, or moisture. 
STS2012-gold surprise.OnWN 4.000 devote or adapt exclusively to an skill, study, or work devote oneself to a special area of work. 
STS2012-gold surprise.OnWN 3.250 elevated wooden porch of a house a porch that resembles the deck on a ship. 
STS2012-gold surprise.OnWN 4.000 either half of an archery bow either of the two halves of a bow from handle to tip. 
STS2012-gold surprise.OnWN 3.333 a removable device that is an accessory to larger object a supplementary part or accessory. 
STS2012-gold surprise.OnWN 4.750 restrict or confine place limits on (extent or access). 
STS2012-gold surprise.OnWN 0.500 orient, be positioned be opposite. 
[email protected]:~/git/stasis$ python 
Python 2.7.10 (default, Jun 30 2015, 15:30:23) 
[GCC 4.8.4] on linux2 
Type "help", "copyright", "credits" or "license" for more information. 
>>> with open('sts.csv') as fin: 
...  for line in fin: 
...    print len(line.split('\t')) 
...    break 
... 
5 

>>> with open('sts.csv') as fin: 
...  for line in fin: 
...    assert len(line.split('\t')) == 5 
... 
>>> 

Еще проверки разумности, что это нет. колонн, @papayawarrior пример линии 4 колонок был разобран правильно в моей версии graphlab:

enter image description here


я вручную проверил проблемные линии, и они:

STS2012-gold MSRpar 3.800 "She was crying and scared,' said Isa Yasin, the owner of the store. "She was crying and she was really scared," said Yasin. 
STS2012-gold MSRpar 2.200 "And about eight to 10 seconds down, I hit. "I was in the water for about eight seconds. 
STS2012-gold MSRpar 2.800 "It's a major victory for Maine, and it's a major victory for other states. The Maine program could be a model for other states. 
STS2012-gold MSRpar 4.000 "Right from the beginning, we didn't want to see anyone take a cut in pay. But Mr. Crosby told The Associated Press: "Right from the beginning, we didn't want to see anyone take a cut in pay. 
STS2012-train MSRpar 3.750 "We put a lot of effort and energy into improving our patching process, probably later than we should have and now we're just gaining incredible speed. "We've put a lot of effort and energy into improving our patching progress, probably later than we should have. 
STS2012-train MSRpar 4.000 "Tomorrow at the Mission Inn, I have the opportunity to congratulate the governor-elect of the great state of California. "I have the opportunity to congratulate the governor-elect of the great state of California, and I'm looking forward to it." 
STS2012-train MSRpar 3.600 "Unlike many early-stage Internet firms, Google is believed to be profitable. The privately held Google is believed to be profitable. 
STS2012-train MSRpar 4.000 "It was a final test before delivering the missile to the armed forces. State radio said it was the last test before the missile was delivered to the armed forces. 
STS2012-train MSRpar 4.750 "The economy, nonetheless, has yet to exhibit sustainable growth. But the economy hasn't shown signs of sustainable growth. 
STS2014-gold deft-forum 0.8 "Then the captain was gone. Then the captain came back. 
STS2014-gold deft-forum 1.8 "Oh, you're such a good person! You're such a bad person!" 
STS2015-gold answers-forums  "Normal, healthy (physically, nutritionally and mentally) individuals have little reason to worry about accidentally consuming too much water. It's fine to skip arm specific exercises if you are already happy with how they are progressing without direct exercises. 
STS2015-gold answers-forums 1.40 "The grass family is one of the most widely distributed and abundant groups of plants on Earth. As noted on the Wiki page, grass seed was imported to the new world to improve pasturage for livestock. 
STS2015-gold answers-forums  "God is exactly this Substance underlying who supports, exist independently of, and persist through time changes in material nature. I'd argue that matter and energy are substances in the category of empirical scientific knowledge. 
STS2015-gold belief  "watching the first fight i saw that manny pacquiao was getting tired, and i wasn't. at the same time, an asian summit is being held in a tourist resort. 
STS2015-gold belief  "global warming doesn't mean every year will be warmer than the last. doesn't matter, that will just be obama's fault as well. 
STS2015-gold belief  "the only reason i'm not as confident that there's something about the birth certificate... the conventional view is that the us and ussr fought it out in the body of vietnam. 
STS2015-gold belief  "im not playing these bullshit games... if not get the hell out of there. 
STS2015-gold belief  "that oil is already contaminating our shoreline. what point are you trying to relay? 
STS2015-gold belief  "we cannot write history with laws. "she's not sitting here" he said. 
STS2015-gold belief  the protest is going well so far. our request is the same. 
STS2015-gold belief  "for over 20 years, i have illustrated the absurd with absurdity, three hours a day, five days a week. for the first 1-2 years he hated me going out with my friends. 

Вместо того, чтобы находить эти строки вручную, повторно очищая эти строки из подробного сообщения PROGRESS: ..., есть способ просто сбрасывать эти строки при загрузке в Graphlab SFrame?

+0

Я не думаю, что проблема связана с двойными кавычками, потому что я не вижу никаких кавычек вообще в ваших исходных данных. Я подозреваю, что реальная проблема заключается в том, что некоторые из ваших строк имеют 4 столбца, а некоторые из них имеют 5 (при условии, что вы разделите табуляцию, как вы указали). Что происходит, когда вы удаляете дополнительные вкладки в последнем столбце? – papayawarrior

+0

Нет, это не дополнительные вкладки, я проверил. Похоже, это то, как Graphlab анализирует столбцы. – alvas

+0

Извините @alvas, я не видел, что вы разместили полные данные. Действительно, для всех строк есть пять столбцов, и есть некоторые напуганные вещи с кавычками. Я удаляю свой первый ответ, но у меня есть новый для вас. – papayawarrior

ответ

2

ОБНОВЛЕНО ОТВЕТ

Извинения @alvas, я не видел, что полный набор данных был связан в исходном посте. Во всех строках действительно пять столбцов, и проблема, похоже, не соответствует котировкам. Анализатор CSV SFrame путается, если в столбце нет совпадающих кавычек, поэтому короткий ответ заключается в изменении символа кавычки на то, что, как вы знаете, не отображается в наборе данных.

import graphlab 
sts = graphlab.SFrame.read_csv('sts.csv', delimiter='\t', 
           column_type_hints=[str, str, float, str, str], 
           quote_char='\0') 

Это успешно читает все 19 097 строк для меня.

Как и в стороне, существует также метод SFrame.read_csv_with_errors, который будет читать «хорошие» строки в SFrame и собирать «плохие» строки в uns parsed SArray. Это позволит вам отслеживать проблематичные строки программным способом.

ОРИГИНАЛЬНЫЙ ОТВЕТ

Ваши строки данных не отображаются, чтобы содержать кавычки, так что это не проблема. Проблема в том, что у вас есть 5 столбцов в некоторых строках данных (и заголовок), но только 4 столбца в других строках данных.

Первая строка состоит из четырех столбцов:

STS2012-gold surprise.OnWN 5.000 render one language in another language restate (words) from one language into another language. 

в то время как вторая строка имеет пять:

STS2012-gold surprise.OnWN 3.250 nations unified by shared interests, history or institutions a group of nations having common interests. 

Чтобы обойти эту проблему, я бы назвал SFrame Csv анализатор дважды, один раз для четырех столбцов и один раз для данных с пятью столбцами. Поскольку первый FOW имеет четыре колонны, что один немного более простой:

import graphlab 
sts4 = graphlab.SFrame.read_csv('sts.csv', delimiter='\t', header=True) 

Для данных в пять столбцов мы должны пропустить заголовок и первую строку, а затем переименовать столбцы:

sts5 = graphlab.SFrame.read_csv('sts.csv', delimiter='\t', 
           header=False, skiprows=2) 
sts5 = sts5.rename({'X1': 'Dataset', 'X2': 'Domain', 'X3': 'Score', 
        'X4': 'Sent1', 'X5': 'Sent2'}) 

Тогда sts4 выглядит

+--------------+---------------+-------+-------------------------------+ 
| Dataset |  Domain | Score |    Sent1    | 
+--------------+---------------+-------+-------------------------------+ 
| STS2012-gold | surprise.OnWN | 5.0 | render one language in ano... | 
| STS2012-gold | surprise.OnWN | 4.0 | devote or adapt exclusivel... | 
+--------------+---------------+-------+-------------------------------+ 

И sts5 является

+--------------+---------------+-------+-------------------------------+ 
| Dataset |  Domain | Score |    Sent1    | 
+--------------+---------------+-------+-------------------------------+ 
| STS2012-gold | surprise.OnWN | 3.25 | nations unified by shared ... | 
| STS2012-gold | surprise.OnWN | 3.25 | convert into absorbable su... | 
| STS2012-gold | surprise.OnWN | 3.25 | elevated wooden porch of a... | 
+--------------+---------------+-------+-------------------------------+ 
+-------------------------------+ 
|    Sent2    | 
+-------------------------------+ 
| a group of nations having ... | 
| soften or disintegrate by ... | 
| a porch that resembles the... | 
+-------------------------------+ 
+0

Странно, что 'graphlab.read_csv' действует weird = ( – alvas

+0

Вы скопировали и вставляете файл' sts.csv'?Попробуйте сохранить файл как есть, нет строк с 4 столбцами = ( – alvas

+0

BTW, что '' \ 0''? – alvas

Смежные вопросы