Twitter json data в Hadoop

Я сделал данные Twitter, передаваемые в HDFS. Это моя конфигурация Twitter-агент:Twitter json data в Hadoop

#setting properties of agent 
 
Twitter-agent.sources=source1 
 
Twitter-agent.channels=channel1 
 
Twitter-agent.sinks=sink1 
 

 
#configuring sources 
 
Twitter-agent.sources.source1.type=com.cloudera.flume.source.TwitterSource 
 
Twitter-agent.sources.source1.channels=channel1 
 
Twitter-agent.sources.source1.consumerKey=<consumer-key> 
 
Twitter-agent.sources.source1.consumerSecret=<consumer-secret> 
 
Twitter-agent.sources.source1.accessToken=<access-token> 
 
Twitter-agent.sources.source1.accessTokenSecret=<Access-Token-secret> 
 
Twitter-agent.sources.source1.keywords= morning, night, hadoop, bigdata 
 

 
#configuring channels 
 
Twitter-agent.channels.channel1.type=memory 
 
Twitter-agent.channels.channel1.capacity=10000 
 
Twitter-agent.channels.channel1.transactionCapacity=100 
 

 
#configuring sinks 
 
Twitter-agent.sinks.sink1.channel=channel1 
 
Twitter-agent.sinks.sink1.type=hdfs 
 
Twitter-agent.sinks.sink1.hdfs.path=flume/tweets 
 
Twitter-agent.sinks.sink1.rollSize=0 
 
Twitter-agent.sinks.sink1.rollCount=10000 
 
Twitter-agent.sinks.sink1.batchSize=1000 
 
Twitter-agent.sinks.sink1.fileType=DataStream 
 
Twitter-agent.sinks.sink1.writeFormat=Text

Twitter Данные успешно потоковом. Но каждый FlumeData файл в HDFS, как это:

SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable� \t ���^�kd��h?�tN ���h{"in_reply_to_status_id_str":null,"in_reply_to_status_id":null,"created_at":"Tue Jun 23 15:09:32 +0000 2015","in_reply_to_user_id_str":null,"source":"<a href=\"http://tweetlogix.com\" rel=\"nofollow\">Tweetlogix<\/a>","retweet_count":0,"retweeted":false,"geo":null,"filter_level":"low","in_reply_to_screen_name":null,"id_str":"613363262709723139","in_reply_to_user_id":null,"favorite_count":0,"id":613363262709723139,"text":"Morning.","place":null,"lang":"en","favorited":false,"possibly_sensitive":false,"coordinates":null,"truncated":false,"timestamp_ms":"1435072172225","entities":{"urls":[],"hashtags":[],"user_mentions":[],"trends":[],"symbols":[]},"contributors":null,"user":{"utc_offset":-14400,"friends_count":195,"profile_image_url_https":"https://pbs.twimg.com/profile_images/613121771093532673/mA5NPv6X_normal.jpg","listed_count":16,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000045222063/847094549362b20f2b1e3c1ff137a80f.png","default_profile_image":false,"favourites_count":891,"description":"See, I was actually on my way to get a piece of burger from Burger King.....","created_at":"Sat Apr 30 00:51:06 +0000 2011","is_translator":false,"profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000045222063/847094549362b20f2b1e3c1ff137a80f.png","protected":false,"screen_name":"NilesDontCurrr","id_str":"290266873","profile_link_color":"FF0000","id":290266873,"geo_enabled":false,"profile_background_color":"FFFFFF","lang":"en","profile_sidebar_border_color":"FFFFFF","profile_text_color":"34AA7A","verified":false,"profile_image_url":"http://pbs.twimg.com/profile_images/613121771093532673/mA5NPv6X_normal.jpg","time_zone":"Eastern Time (US & Canada)","url":null,"contributors_enabled":false,"profile_background_tile":true,"profile_banner_url":"https://pbs.twimg.com/profile_banners/290266873/1432844093","statuses_count":68154,"follow_request_sent":null,"followers_count":4611,"profile_use_background_image":true,"default_profile":false,"following":null,"name":"niles.","location":"New York City.","profile_sidebar_fill_color":"AFDFB7","notifications":null}}

Когда я анализирую эти данные JSON в улье я получаю сообщение об ошибке, как

Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('S' (code 83)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') 
 
at [Source: [email protected]; line: 1, column: 2]

Я думаю, что ошибка связана с этой строкой, которая является первой строкой в каждом файле FlumeData. SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable� ��^�kd��h?�tN ��h Я прав?

Разве данные twitter json не должны начинаться следующим образом: {"in_reply_to_status_id_str":......}?

источник

2015-06-26 MChirukuri

Его не только начало json. Пожалуйста, проверьте остальные данные и почему эта ошибка «Неожиданный символ (« S »(код 83))» – Ramzy

Я включил один полный файл FlumeData, пожалуйста, посмотрите. Я думаю, что ошибка связана с первой строкой, начинающейся с «SEQ» – MChirukuri

Это не файл json, это файл последовательности с JSON в нем, закодированный как массив байтов. –

Flume генерирует файлы в двоичном формате вместо текстового формата. Это связано с тем, что некоторые свойства в вашем файле конфигурации не установлены правильно, в том числе ниже двух свойств.

Twitter-agent.sinks.sink1.fileType=DataStream 
Twitter-agent.sinks.sink1.writeFormat=Text

Правильный способ установки свойств приведен ниже.

Twitter-agent.sinks.sink1.hdfs.fileType=DataStream 
Twitter-agent.sinks.sink1.hdfs.writeFormat=Text

источник

2015-06-27 18:30:46 Shubhangi

Это сработало. благодаря – MChirukuri

Twitter json data в Hadoop

ответ

Смежные вопросы