2014-01-11 3 views
0

Мне нужно извлечь все ссылки с веб-страницы. Мое текущее решение только извлекает ссылку из <a> тегов:Извлечь все ссылки с веб-страницы

def get_links(url) 
    Nokogiri::HTML(open(url).read).css("a").map do |link| 
    if (href = link.attr("href")) && href.match(/^https?:/) 
     href 
    end 
    end.compact 
end 

это решения copypasted от одного из ответов на этот question

Проблему заключается в ссылках в HTML документах, не обязательно выглядеть как href атрибута внутри тега <a> , Мне нужно извлечь все полные/относительные ссылки http/https из файлов html/css. Есть ли для этого твердый «поселившийся»?

+0

Дайте источник html и некоторые из выходов, которые вы хотите (просто чтобы дать нам подсказку) –

+0

буквально на любой веб-странице. возможно, я слишком много прошу, но мне нужен хороший стартер. –

ответ

3

Вы можете сделать это, используя встроенный URI-класс Ruby. Посмотрите на метод extract.

Это не так умно, как то, что вы могли бы написать, используя Nokogiri и глядя в якорях, изображениях, сценариях, обработчиках on_click и т. Д., Но это хорошая и быстрая отправная точка.

Например, глядя на содержание страницы на этот вопрос в:

require 'open-uri' 
require 'uri' 

URI.extract(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read).grep(/^https?:/) 
# => ["http://cdn.sstatic.net/stackoverflow/img/[email protected]?v=fde65a5a78c6", 
#  "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page", 
#  "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page", 
#  "https://stackauth.com", 
#  "http://chat.stackoverflow.com", 
#  "http://blog.stackexchange.com", 
#  "http://schema.org/Article", 
#  "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby", 
#  "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1", 
#  "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract", 
#  "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG", 
#  "http://stackexchange.com/legal/privacy-policy'", 
#  "http://stackexchange.com/legal/terms-of-service'", 
#  "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000", 
#  "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not", 
#  "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved", 
#  "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really", 
#  "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator", 
#  "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment", 
#  "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success", 
#  "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity", 
#  "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code", 
#  "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published", 
#  "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree", 
#  "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky", 
#  "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it", 
#  "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list", 
#  "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire", 
#  "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing", 
#  "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs", 
#  "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen", 
#  "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems", 
#  "http://codegolf.stackexchange.com/questions/18028/largest-number-printable", 
#  "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd", 
#  "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages", 
#  "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again", 
#  "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial", 
#  "http://blog.stackexchange.com?blb=1", 
#  "http://chat.stackoverflow.com", 
#  "http://data.stackexchange.com", 
#  "http://stackexchange.com/legal", 
#  "http://stackexchange.com/legal/privacy-policy", 
#  "http://stackexchange.com/about/hiring", 
#  "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8", 
#  "http://meta.stackoverflow.com", 
#  "http://stackoverflow.com", 
#  "http://serverfault.com", 
#  "http://superuser.com", 
#  "http://webapps.stackexchange.com", 
#  "http://askubuntu.com", 
#  "http://webmasters.stackexchange.com", 
#  "http://gamedev.stackexchange.com", 
#  "http://tex.stackexchange.com", 
#  "http://programmers.stackexchange.com", 
#  "http://unix.stackexchange.com", 
#  "http://apple.stackexchange.com", 
#  "http://wordpress.stackexchange.com", 
#  "http://gis.stackexchange.com", 
#  "http://electronics.stackexchange.com", 
#  "http://android.stackexchange.com", 
#  "http://security.stackexchange.com", 
#  "http://dba.stackexchange.com", 
#  "http://drupal.stackexchange.com", 
#  "http://sharepoint.stackexchange.com", 
#  "http://ux.stackexchange.com", 
#  "http://mathematica.stackexchange.com", 
#  "http://stackexchange.com/sites#technology", 
#  "http://photo.stackexchange.com", 
#  "http://scifi.stackexchange.com", 
#  "http://cooking.stackexchange.com", 
#  "http://diy.stackexchange.com", 
#  "http://stackexchange.com/sites#lifearts", 
#  "http://english.stackexchange.com", 
#  "http://skeptics.stackexchange.com", 
#  "http://judaism.stackexchange.com", 
#  "http://travel.stackexchange.com", 
#  "http://christianity.stackexchange.com", 
#  "http://gaming.stackexchange.com", 
#  "http://bicycles.stackexchange.com", 
#  "http://rpg.stackexchange.com", 
#  "http://stackexchange.com/sites#culturerecreation", 
#  "http://math.stackexchange.com", 
#  "http://stats.stackexchange.com", 
#  "http://cstheory.stackexchange.com", 
#  "http://physics.stackexchange.com", 
#  "http://mathoverflow.net", 
#  "http://stackexchange.com/sites#science", 
#  "http://stackapps.com", 
#  "http://meta.stackoverflow.com", 
#  "http://area51.stackexchange.com", 
#  "http://careers.stackoverflow.com", 
#  "http://creativecommons.org/licenses/by-sa/3.0/", 
#  "http://blog.stackoverflow.com/2009/06/attribution-required/", 
#  "http://creativecommons.org/licenses/by-sa/3.0/", 
#  "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif", 
#  "https:", 
#  "https:'==document.location.protocol,", 
#  "https://ssl", 
#  "http://www", 
#  "https://secure", 
#  "http://edge", 
#  "https:", 
#  "https://sb", 
#  "http://b"] 

Есть много других записей, но с использованием grep фильтров их с помощью простой /^https?:/ шаблон.

Простая отправная точка с Nokogiri является:

require 'open-uri' 
require 'nokogiri' 

doc = Nokogiri::HTML(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read) 
urls = doc.search('a, img').map{ |tag| 
    case tag.name.downcase 
    when 'a' 
    tag['href'] 
    when 'img' 
    tag['src'] 
    end 
} 

urls 
# => ["//stackexchange.com/sites", 
#  "http://chat.stackoverflow.com", 
#  "http://blog.stackexchange.com", 
#  "//stackoverflow.com", 
#  "//meta.stackoverflow.com", 
#  "//careers.stackoverflow.com", 
#  "//stackexchange.com", 
#  "https://stackoverflow.com/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456", 
#  "https://stackoverflow.com/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456", 
#  "/tour", 
#  "/help", 
#  "//careers.stackoverflow.com", 
#  "/", 
#  "/questions", 
#  "/tags", 
#  "/about", 
#  "/users", 
#  "https://stackoverflow.com/questions/ask", 
#  "/about", 
#  nil, 
#  "https://stackoverflow.com/questions/21069348/extract-all-links-from-web-page", 
#  nil, 
#  nil, 
#  "#", 
#  "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby", 
#  "https://stackoverflow.com/questions/tagged/html", 
#  "https://stackoverflow.com/questions/tagged/ruby-on-rails", 
#  "https://stackoverflow.com/questions/tagged/ruby", 
#  "https://stackoverflow.com/questions/tagged/regex", 
#  "https://stackoverflow.com/questions/tagged/hyperlink", 
#  "https://stackoverflow.com/q/21069348", 
#  "/posts/21069348/edit", 
#  "https://stackoverflow.com/users/2886945/ivan-denisov", 
#  "https://stackoverflow.com/users/2886945/ivan-denisov", 
#  "https://stackoverflow.com/users/2767755/arup-rakshit", 
#  "https://stackoverflow.com/users/2886945/ivan-denisov", 
#  nil, 
#  nil, 
#  "https://stackoverflow.com/questions/21069348/extract-all-links-from-web-page?answertab=active#tab-top", 
#  "https://stackoverflow.com/questions/21069348/extract-all-links-from-web-page?answertab=oldest#tab-top", 
#  "https://stackoverflow.com/questions/21069348/extract-all-links-from-web-page?answertab=votes#tab-top", 
#  nil, 
#  nil, 
#  nil, 
#  "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract", 
#  "https://stackoverflow.com/a/21069456", 
#  "/posts/21069456/revisions", 
#  "https://stackoverflow.com/users/128421/the-tin-man", 
#  "https://stackoverflow.com/users/128421/the-tin-man", 
#  nil, 
#  nil, 
#  nil, 
#  nil, 
#  "http://regex101.com/r/hN4dI0", 
#  "https://stackoverflow.com/a/21069536", 
#  "https://stackoverflow.com/users/1214800/r3mus", 
#  "https://stackoverflow.com/users/1214800/r3mus", 
#  nil, 
#  nil, 
#  "https://stackoverflow.com/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%23new-answer", 
#  "#", 
#  "http://stackexchange.com/legal/privacy-policy", 
#  "http://stackexchange.com/legal/terms-of-service", 
#  "https://stackoverflow.com/questions/tagged/html", 
#  "https://stackoverflow.com/questions/tagged/ruby-on-rails", 
#  "https://stackoverflow.com/questions/tagged/ruby", 
#  "https://stackoverflow.com/questions/tagged/regex", 
#  "https://stackoverflow.com/questions/tagged/hyperlink", 
#  "https://stackoverflow.com/questions/ask", 
#  "https://stackoverflow.com/questions/tagged/html", 
#  "https://stackoverflow.com/questions/tagged/ruby-on-rails", 
#  "https://stackoverflow.com/questions/tagged/ruby", 
#  "https://stackoverflow.com/questions/tagged/regex", 
#  "https://stackoverflow.com/questions/tagged/hyperlink", 
#  "?lastactivity", 
#  "https://stackoverflow.com/q/21052437", 
#  "https://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs", 
#  "https://stackoverflow.com/q/6700367", 
#  "https://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby", 
#  "https://stackoverflow.com/q/430966", 
#  "https://stackoverflow.com/questions/430966/regex-for-links-in-html-text", 
#  "https://stackoverflow.com/q/3703712", 
#  "https://stackoverflow.com/questions/3703712/extract-all-links-from-a-html-page-exclude-links-from-a-specific-table", 
#  "https://stackoverflow.com/q/5120171", 
#  "https://stackoverflow.com/questions/5120171/extract-links-from-a-web-page", 
#  "https://stackoverflow.com/q/6816138", 
#  "https://stackoverflow.com/questions/6816138/extract-absolute-links-from-a-page-uisng-htmlparser", 
#  "https://stackoverflow.com/q/10177910", 
#  "https://stackoverflow.com/questions/10177910/php-regular-expression-extracting-html-links", 
#  "https://stackoverflow.com/q/10217857", 
#  "https://stackoverflow.com/questions/10217857/extracting-background-images-from-a-web-page-parsing-htmlcss", 
#  "https://stackoverflow.com/q/11300496", 
#  "https://stackoverflow.com/questions/11300496/how-to-extract-a-link-from-head-tag-of-a-remote-page-using-curl", 
#  "https://stackoverflow.com/q/11307491", 
#  "https://stackoverflow.com/questions/11307491/how-to-extract-all-links-on-a-page-using-crawler4j", 
#  "https://stackoverflow.com/q/17712493", 
#  "https://stackoverflow.com/questions/17712493/extract-links-from-bbcode-with-ruby", 
#  "https://stackoverflow.com/q/20290869", 
#  "https://stackoverflow.com/questions/20290869/strip-away-html-tags-from-extracted-links", 
#  "//stackexchange.com/questions?tab=hot", 
#  "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000", 
#  "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not", 
#  "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved", 
#  "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really", 
#  "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator", 
#  "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment", 
#  "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success", 
#  "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity", 
#  "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code", 
#  "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published", 
#  "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree", 
#  "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky", 
#  "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it", 
#  "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list", 
#  "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire", 
#  "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing", 
#  "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs", 
#  "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen", 
#  "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems", 
#  "http://codegolf.stackexchange.com/questions/18028/largest-number-printable", 
#  "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd", 
#  "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages", 
#  "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again", 
#  "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial", 
#  "#", 
#  "/feeds/question/21069348", 
#  "/about", 
#  "/help", 
#  "/help/badges", 
#  "http://blog.stackexchange.com?blb=1", 
#  "http://chat.stackoverflow.com", 
#  "http://data.stackexchange.com", 
#  "http://stackexchange.com/legal", 
#  "http://stackexchange.com/legal/privacy-policy", 
#  "http://stackexchange.com/about/hiring", 
#  "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8", 
#  nil, 
#  "/contact", 
#  "http://meta.stackoverflow.com", 
#  "http://stackoverflow.com", 
#  "http://serverfault.com", 
#  "http://superuser.com", 
#  "http://webapps.stackexchange.com", 
#  "http://askubuntu.com", 
#  "http://webmasters.stackexchange.com", 
#  "http://gamedev.stackexchange.com", 
#  "http://tex.stackexchange.com", 
#  "http://programmers.stackexchange.com", 
#  "http://unix.stackexchange.com", 
#  "http://apple.stackexchange.com", 
#  "http://wordpress.stackexchange.com", 
#  "http://gis.stackexchange.com", 
#  "http://electronics.stackexchange.com", 
#  "http://android.stackexchange.com", 
#  "http://security.stackexchange.com", 
#  "http://dba.stackexchange.com", 
#  "http://drupal.stackexchange.com", 
#  "http://sharepoint.stackexchange.com", 
#  "http://ux.stackexchange.com", 
#  "http://mathematica.stackexchange.com", 
#  "http://stackexchange.com/sites#technology", 
#  "http://photo.stackexchange.com", 
#  "http://scifi.stackexchange.com", 
#  "http://cooking.stackexchange.com", 
#  "http://diy.stackexchange.com", 
#  "http://stackexchange.com/sites#lifearts", 
#  "http://english.stackexchange.com", 
#  "http://skeptics.stackexchange.com", 
#  "http://judaism.stackexchange.com", 
#  "http://travel.stackexchange.com", 
#  "http://christianity.stackexchange.com", 
#  "http://gaming.stackexchange.com", 
#  "http://bicycles.stackexchange.com", 
#  "http://rpg.stackexchange.com", 
#  "http://stackexchange.com/sites#culturerecreation", 
#  "http://math.stackexchange.com", 
#  "http://stats.stackexchange.com", 
#  "http://cstheory.stackexchange.com", 
#  "http://physics.stackexchange.com", 
#  "http://mathoverflow.net", 
#  "http://stackexchange.com/sites#science", 
#  "http://stackapps.com", 
#  "http://meta.stackoverflow.com", 
#  "http://area51.stackexchange.com", 
#  "http://careers.stackoverflow.com", 
#  "http://creativecommons.org/licenses/by-sa/3.0/", 
#  "http://blog.stackoverflow.com/2009/06/attribution-required/", 
#  "http://creativecommons.org/licenses/by-sa/3.0/", 
#  "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1", 
#  "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG", 
#  "http://i.stack.imgur.com/fmgha.jpg?s=32&g=1", 
#  "/posts/21069348/ivc/8228", 
#  "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif"] 

Это использует case заявления применить немного «СМАРТС», чтобы знать, какие поля должны быть извлечено из определенного типа тега. Необходимо выполнить больше работы, поскольку якорь может использовать on_click, и для событий JavaScript могут использоваться другие теги.

+0

Да. Нокогири в конце концов более гибкая. Я смогу преобразовать относительные ссылки в absolute и т. Д. –

+0

Используйте URI или Addressable :: URI для управления URI. Они предназначены для этого. –

0

Я согласен с тем, что ответ олова - это, несомненно, лучший маршрут. Если вы сделать необходимо регулярное выражение приема всей, что будет захватывать все URLs (как можно ближе к точным, насколько это возможно) это должно работать:

\w+:\/\/[\w.-]+(?::?\d{1,5})?[-\w.\/?=&%]* 

несколько примеров: http://regex101.com/r/hN4dI0

Обратите внимание, что это требует префикса протокола (http://, mailto://), поэтому он не будет соответствовать только www.google.com.

+0

Вот что делает URI внутри. –

+0

@theTinMan да;) Я в основном ставил это здесь для людей, которые находят этот вопрос и не используют рельсы. – brandonscript

Смежные вопросы