2016-08-13 2 views
0

Мне нужно получить текст из следующего элемента span без разделения его на текстовые разделы.Извлечение текста из диапазона с использованием запроса xpath или css

<span class="a-size-base review-text">I purchased this from Fry's Electronics. 
 
<br/> 
 
<br/> 
 
The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing. 
 
<br/> 
 
<br/> 
 
I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems. 
 
<br/> 
 
<br/> 
 
The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome. 
 
<br/> 
 
<br/> 
 
The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge. 
 
<br/> 
 
<br/> 
 
Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's). 
 
<br/> 
 
<br/> 
 
NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person. 
 
</span>

Однако применяя мой XPATH запрос

// * [содержит (CONCAT ("", @class ""), CONCAT ("", «обзор -text», ""))]/текст()

я получаю это:

Text='I purchased this from Fry's Electronics.' 
 
Text='' 
 
Text='The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing.' 
 
Text='' 
 
Text='I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems.' 
 
Text='' 
 
Text='The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome.' 
 
Text='' 
 
Text='The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge.' 
 
Text='' 
 
Text='Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's).' 
 
Text='' 
 
Text='NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person.'

Я хотел бы получить один блок текста без обрывов. Я использую этот XPATH тестер http://www.freeformatter.com/xpath-tester.html

ответ

0

Удобной особенность Scrapy селекторов Селекторы цепочка, так что вы можете начать с выбором CSS, а затем применить методы строковых XPath, такие как string() или normalize-space().

Вот пример Scrapy 1.1 сессии оболочки:

~$ scrapy shell 
2016-08-16 12:20:57 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot) 
2016-08-16 12:20:57 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'} 
(...) 
In [1]: html = '''<span class="a-size-base review-text">I purchased this from Fry's Electronics. 
    ...: <br/> 
    ...: <br/> 
    ...: The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing. 
    ...: <br/> 
    ...: <br/> 
    ...: I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems. 
    ...: <br/> 
    ...: <br/> 
    ...: The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome. 
    ...: <br/> 
    ...: <br/> 
    ...: The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge. 
    ...: <br/> 
    ...: <br/> 
    ...: Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's). 
    ...: <br/> 
    ...: <br/> 
    ...: NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person. 
    ...: </span>''' 

In [2]: import scrapy 

In [3]: selector = scrapy.Selector(text=html) 

In [4]: selector.css('span.review-text').xpath('string()').extract_first() 
Out[4]: 'I purchased this from Fry\'s Electronics.\n\n\nThe picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I\'m very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing.\n\n\nI wasn\'t planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems.\n\n\nThe unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you\'re ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome.\n\n\nThe stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge.\n\n\nOverall I\'m very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry\'s).\n\n\nNOTE: If you see any strange distortion in the images it\'s likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person.\n' 

In [5]: print(selector.css('span.review-text').xpath('string()').extract_first()) 
I purchased this from Fry's Electronics. 


The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing. 


I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems. 


The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome. 


The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge. 


Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's). 


NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person. 


In [6]: print(selector.css('span.review-text').xpath('normalize-space()').extract_first()) 
I purchased this from Fry's Electronics. The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing. I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems. The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome. The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge. Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's). NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person. 
+0

Спасибо @paul trmbrth. Отличное решение! – Brayoni

0

Преобразовать весь <span> элемент string:

string(
    //*[contains(concat(" ", @class, " a-size-base review-text"), concat(" ", "review-text", " "))] 
) 

Обратите внимание, что это работает только для первого <span> элемента, соответствующего критериям. В XPath 2.0, вы можете использовать string-join(), который будет работать с произвольным числом элементов <span>:

string-join( 
    //*[contains(concat(" ", @class, " a-size-base review-text"), concat(" ", "review-text", " "))]/text(), 
    "" 
) 
+0

Я использую ** LXML **, который поддерживает только _xpath 1.0_, так что я не могу использовать 'строкового join'. Если я преобразую весь элемент в 'string'. Кажется, что запрос _xpath_ возвращает одну строку вместо списка. – Brayoni

+0

Следующее возвращает список на scrapy shell. 'response.xpath ('// * [содержит (concat (" ", @class," "), concat (" "," review-text "," "))]'). extract()'. – Brayoni

0

я должен был отправить процесс, чтобы удалить HTML теги с помощью питона регулярных выражений.

re.sub(r'<span class="a-size-base review-text">|<br>|</span>', "", text) 

Я пробовал предложения @ har07;

  • SCRAPY использует LXML, который поддерживает только XPath 1,0 поэтому я не мог воспользоваться string-join, который доступен в XPath 2.0
  • Я не мог получить список селекторов из моего XPATH запроса, когда я попытался string.
Смежные вопросы