2015-06-09 5 views
1

Я объединил некоторые уроки по веб-очистке и сделал простой веб-искатель, который очищает новый размещенный вопрос здесь, на SO. Я хочу загрузить их в мою базу данных postgresql, но у меня возникают проблемы с ошибкой декодирования, которую показывает мой искатель.Загрузка скребковых данных в Postgresql

Ошибка:

2015-06-09 06:07:10+0200 [stack] ERROR: Error processing {'title': u'Laravel 5 Confused when implements ShoudlQueue', 
    'url': u'/questions/30722718/laravel-5-confused-when-implements-shoudlqueue'} 
    Traceback (most recent call last): 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/scrapy/middleware.py", line 62, in _process_chain 
     return process_chain(self.methods[methodname], obj, *args) 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 65, in process_chain 
     d.callback(input) 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 393, in callback 
     self._startRunCallbacks(result) 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks 
     self._runCallbacks() 
    --- <exception caught here> --- 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks 
     current.result = callback(current.result, *args, **kw) 
     File "/home/petarp/Documents/PyScraping/RealPython/WebScraping/stack/stack/pipelines.py", line 27, in process_item 
     session.commit() 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 790, in commit 
     self.transaction.commit() 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 392, in commit 
     self._prepare_impl() 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 372, in _prepare_impl 
     self.session.flush() 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2004, in flush 
     self._flush(objects) 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2122, in _flush 
     transaction.rollback(_capture_exception=True) 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__ 
     compat.reraise(exc_type, exc_value, exc_tb) 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2086, in _flush 
     flush_context.execute() 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 373, in execute 
     rec.execute(self) 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 532, in execute 
     uow 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 174, in save_obj 
     mapper, table, insert) 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 761, in _emit_insert_statements 
     execute(statement, params) 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute 
     return meth(self, multiparams, params) 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection 
     return connection._execute_clauseelement(self, multiparams, params) 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement 
     compiled_sql, distilled_params 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context 
     context) 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception 
     exc_info 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause 
     reraise(type(exception), exception, tb=exc_tb) 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context 
     context) 
     File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute 
     cursor.execute(statement, parameters) 
    sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) column "url" of relation "reals" does not exist 
    LINE 1: INSERT INTO reals (title, url) VALUES ('Laravel 5 Confused w... 
            ^
    [SQL: 'INSERT INTO reals (title, url) VALUES (%(title)s, %(url)s) RETURNING reals.id'] [parameters: {'url': u'/questions/30722718/laravel-5-confused-when-implements-shoudlqueue', 'title': u'Laravel 5 Confused when implements ShoudlQueue'}] 

Я использовал SQLAlchemy, чтобы определить связь между искателем и PostgreSQL. Ниже приведены параметры settings.py, models.py и pipelines.py.

Settings.py:

BOT_NAME = 'stack' 

SPIDER_MODULES = ['stack.spiders'] 
NEWSPIDER_MODULE = 'stack.spiders' 
ITEM_PIPELINES = ['stack.pipelines.StackPipeline'] 
# Crawl responsibly by identifying yourself (and your website) on the user-agent 
# USER_AGENT = 'stack (+http://www.yourdomain.com)' 
DATABASE = { 
    'drivername': 'postgres', 
    'host': 'localhost', 
    'port': '5432', 
    'username': 'petarp', 
    'password': 'gnomeregan', 
    'database': 'realpython' 
} 

Models.py:

from sqlalchemy import create_engine, Column, Integer, String 
from sqlalchemy.ext.declarative import declarative_base 
from sqlalchemy.engine.url import URL 

import settings 


DeclarativeBase = declarative_base() 


def db_connect(): 
    """ Performs database connections using database settings from settings.py 
     Returns sqlalchemy engine instance 
    """ 
    return create_engine(URL(**settings.DATABASE)) 


def create_reals_table(engine): 
    """""" 
    DeclarativeBase.metadata.create_all(engine) 


class Reals(DeclarativeBase): 
    """SQLAlchemy Reals Model""" 
    __tablename__ = 'reals' 

    id = Column(Integer, primary_key=True) 
    title = Column('title', String) 
    url = Column('url', String, nullable=True) 

Pipeline.py:

from sqlalchemy.orm import sessionmaker 
from models import Reals, db_connect, create_reals_table 


class StackPipeline(object): 
    """ Stack Exchange pipeline for storing scraped items in the database """ 
    def __init__(self): 
     """ Initialize database connection and sessionmaker """ 
     engine = db_connect() 
     create_reals_table(engine) 
     self.Session = sessionmaker(bind=engine) 

    def process_item(self, item, spider): 
     """Save reals in database. 
     This method is called for every item pipeline componenet.""" 
     session = self.Session() 
     real = Reals(**item) 

     try: 
      session.add(real) 
      session.commit() 
     except: 
      session.rollback() 
      raise 
     finally: 
      session.close() 
     return item 

Шма для реальной таблицы:

realpython=# select * from reals limit 5; 
id | title | link 
----+-------+------ 
(0 rows) 

Может кто-нибудь помочь мне понять, что здесь происходит, и декодировать это?

+0

Не могли бы вы проверить, что ваш 'reals' таблица фактически содержит все необходимые столбцы? – plaes

+1

предоставляют схему 'reals', похоже, что в ней нет URL-адреса столбца – The6thSense

+0

Да, вы правы, в нем нет URL-адреса столбца. – PetarP

ответ

1

Сообщение об ошибке на самом деле само за себя - вы просто должны смотреть на последние несколько строк:

sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) column "url" of relation "reals" does not exist

Таким образом, вам необходимо либо изменить SQL для вставки в столбец с именем link, или необходимо переименовать столбец в таблице с помощью ALTER TABLE reals RENAME COLUMN link TO url;.

1

Я нашел решение.

Проблема была в url, link определение в моем Items.py, я определил это так, и в моделях горных Я создаю таблицу схемы йота link, так что я просто заменить url йота link и данные загружены успешный в postgresql.

from scrapy import Item, Field 


    class StackItem(Item): 
     # define the fields for your item here like: 
     # name = scrapy.Field() 
     title = Field() 
     url = Field() 

Новый Items.py:

from scrapy import Item, Field 


class StackItem(Item): 
    # define the fields for your item here like: 
    # name = scrapy.Field() 
    title = Field() 
    link = Field() 

Желаемый результат:

id |         title         |           link           
----+------------------------------------------------------------------------+---------------------------------------------------------------------------------------- 
    1 | pointcut execution for specific class constructor      | /questions/30723494/pointcut-execution-for-specific-class-constructor 
    2 | PWX-00001 Error opening repository “dtlmsg.txt”. RCs = 268/150/2  | /questions/30723493/pwx-00001-error-opening-repository-dtlmsg-txt-rcs-268-150-2 
    3 | Can anyone share a sample c++ program, that reads ASCII stl type file? | /questions/30723491/can-anyone-share-a-sample-c-program-that-reads-ascii-stl-type-file 
    4 | Where should I do the core logic code in express js?     | /questions/30723487/where-should-i-do-the-core-logic-code-in-express-js 
    5 | configuring rails application to make ui router work     | /questions/30723485/configuring-rails-application-to-make-ui-router-work 
(5 rows) 
+0

И да, мне нужно изменить колонку url схемы, чтобы быть ссылкой. – PetarP

Смежные вопросы