2016-08-15 5 views
0

Я следующий pyspark код, который бросает ошибкуPySpark бросали ошибки при выполнении задания MapReduce

data = sc.textFile("file:///zika-map/cdc_zika/update_clean_zika.csv") 
header = data.first() 
byCountryNoHeader = data.filter(lambda x: x!=header) 
sepColumn = byCountryNoHeader.map(lambda x: x.split(",")) 
byCountry =sepColumn.map(lambda x: (x[1], x[5])).reduceByKey(lambda x,y: int(x)+int(y)) 
byCountry.collect() 

Update_clean_zika.csv имеет данные, такие как, как показано ниже:

report date country city location type data field value unit 
19/03/2016 Argentina Buenos Aires province cumulative confirmed local cases 0 cases 
19/03/2016 Argentina Buenos Aires province cumulative probable local cases 0 cases 
19/03/2016 Argentina Buenos Aires province cumulative confirmed imported cases 2 cases 
19/03/2016 Argentina Buenos Aires province cumulative probable imported cases 1 cases 
19/03/2016 Argentina Buenos Aires province cumulative cases under study 127 cases 
19/03/2016 Argentina Buenos Aires province cumulative cases discarded 0 cases 
19/03/2016 Argentina CABA province cumulative confirmed local cases 0 cases 
19/03/2016 Argentina CABA province cumulative probable local cases 0 cases 
19/03/2016 Argentina CABA province cumulative confirmed imported cases 9 cases 
19/03/2016 Argentina CABA province cumulative probable imported cases 0 cases 
19/03/2016 Argentina CABA province cumulative cases under study 68 cases 

В общем, что я пытаюсь делать, сопоставлять страны с делами, а затем придумывать общие случаи по стране. Mapping работает отлично, но reduceByKey вызывает ошибку, как показано ниже:

Traceback (most recent call last): 

    File "<ipython-input-19-db6ad3fdabe0>", line 16, in <module> 
    byCountry.groupByKey().collect() 

    File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 771, in collect 
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 

    File "C:\Spark\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", line 813, in __call__ 
    answer, self.gateway_client, self.target_id, self.name) 

    File "C:\Spark\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value 
    format(target_id, ".", name), value) 

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. 
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 46.0 failed 1 times, most recent failure: Lost task 0.0 in stage 46.0 (TID 63, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "C:\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main 
    File "C:\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process 
    File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 2346, in pipeline_func 
    File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 2346, in pipeline_func 
    File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 317, in func 
    File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 1776, in combineLocally 
    File "C:\Spark\python\lib\pyspark.zip\pyspark\shuffle.py", line 238, in mergeValues 
    d[k] = comb(d[k], v) if k in d else creator(v) 
    File "<ipython-input-19-db6ad3fdabe0>", line 7, in <lambda> 
ValueError: invalid literal for int() with base 10: 'zika confirmed laboratory' 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) 
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
    at java.lang.Thread.run(Unknown Source) 

Driver stacktrace: 
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) 
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) 
    at scala.Option.foreach(Option.scala:236) 
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) 
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) 
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) 
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926) 
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405) 
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) 
    at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) 
    at java.lang.reflect.Method.invoke(Unknown Source) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) 
    at py4j.Gateway.invoke(Gateway.java:259) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:209) 
    at java.lang.Thread.run(Unknown Source) 
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "C:\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main 
    File "C:\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process 
    File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 2346, in pipeline_func 
    File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 2346, in pipeline_func 
    File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 317, in func 
    File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 1776, in combineLocally 
    File "C:\Spark\python\lib\pyspark.zip\pyspark\shuffle.py", line 238, in mergeValues 
    d[k] = comb(d[k], v) if k in d else creator(v) 
    File "<ipython-input-19-db6ad3fdabe0>", line 7, in <lambda> 
ValueError: invalid literal for int() with base 10: 'zika confirmed laboratory' 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) 
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
1 more 

Я пробовал различные способы и различные темы из Stackoverflow, но не повезло. Любая помощь или предложение будут оценены.

+0

Почему вы не используете DataFrames? –

+0

Я сделал это с zeppelin, но опять подобный вид ошибки. – Emdadul

+0

Я бы рекомендовал вам использовать DataFrames вместе с разъемом Databricks Spark CSV. –

ответ

0

Я понял, в основном, было несколько вопросов с нулевым значением. Таким образом, созданный dataframe, а затем использование Spark SQL было в состоянии игнорировать их.

0

У вас есть Value Error, который происходит в лямбда-функции, когда вы делаете: int(x) + int(y). stderr показывает: ValueError: invalid literal for int() with base 10: 'zika confirmed laboratory', что означает, что некоторые значения x[5] не могут быть преобразованы в int, т. е. «zika limited lab» не может быть преобразована в int. Вероятно, вам просто нужно исправить индексирование.

+0

Я проверил значения в столбце x [5], и нет ни одного " zika подтвердила лабораторию "в этой колонке – Emdadul

+0

Для части csv, которую вы указали выше, если вы хотите добавить столбцы значений, то не будет ли это' x [4] 'для пятого столбца? –

+0

Нет, поскольку столбцы являются report_date, country, city, location_type, data_field, value, unit – Emdadul

Смежные вопросы