У меня есть HTML-файл, который я хотел бы, чтобы извлечь некоторые данные и построить вектор из результата:как извлекать некоторые данные из файла формата HTML в R
Мой HTML файл выглядит следующим образом:
данные
"<HTML>\r\n<HEAD>\r\n<meta http-equiv=\"Expires\" content=\"0\"/>\n<meta http-equiv=\"Pragma\" content=\"no-cache\"/>\n\r\n<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html;CHARSET=Cp1252\"/>\r\n\r\n<TITLE>file</TITLE>\r\n<LINK REL=\"stylesheet\" TYPE=\"text/css\" HREF=\"/SiteScope/htdocs/artwork/sitescopeUI.css\"/>\r\n</HEAD>\n\r\n<BODY BGCOLOR=\"#ffffff\" LINK=#1155bb ALINK=#1155bb VLINK=#1155bb>\n\r\n<H2></H2><p><p>\r\n<A HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/latest.html><B>Most Recent Report</B></A>\r\n<P><CENTER>\n<A NAME=uptimeSummary> </A>\n<TABLE WIDTH=\"100%\" BORDER=1 CELLSPACING=0>\n <CAPTION><B>Report Summary</B></CAPTION>\r\n <TR BGCOLOR=\"#88AA99\"><TH> </TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag1</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag10</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag11</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag12</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag13</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag14</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag15</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag16</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag2</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on server1</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag4</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on server2</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag6</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on server3</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag8</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on server9</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag17</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on server10</TH></TR>\r\n <TR BGCOLOR=\"#DDDDDD\"><TD><B>Information For</B></TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD></TR>\r\n <TR BGCOLOR=\"#DDDDDD\"><TD><A HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-15_33-01_25_2015.html>3:33 PM 1/18/15 - 3:33 PM 1/25/15</A> (<A HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-15_33-01_25_2015.txt>text</A>)</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.67%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">28%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2.85%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">10%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.65%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">18%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.54%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">14%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.12%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">15%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.42%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">18%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.72%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">6%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.26%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">30%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.42%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">16%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.4%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">16%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.58%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">7%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.46%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">7%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.4%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.49%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">11%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2.25%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">8%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.49%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4%</TD></TR>\r\n <TR BGCOLOR=\"#DDDDDD\"><TD><A HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-01_05-01_25_2015.html>1:05 AM 1/18/15 - 1:05 AM 1/25/15</A> (<A HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-01_05-01_25_2015.txt>text</A>)</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.68%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">28%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2.75%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">10%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.6%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">18%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.41%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">14%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.11%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">15%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.39%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">18%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.72%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">6%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.25%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">30%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.43%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">16%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.39%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">16%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.58%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">7%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.46%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">7%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.4%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.49%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">11%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2.17%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">8%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.55%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4%</TD></TR>\r\n <TR BGCOLOR=\"#DDDDDD\"><TD><A HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-11_26-01_20_2015.html>11:26 AM 1/13/15 - 11:26 AM 1/20/15</A> (<A HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-11_26-01_20_2015.txt>text</A>)</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.83%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">27%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2.74%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">15%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.51%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">6%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.64%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">20%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.32%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">21%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.84%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">20%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.72%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">7%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.39%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">27%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.49%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">16%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.45%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">16%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.65%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">7%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.51%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">6%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.42%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">6%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.55%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">7%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2.11%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">8%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.61%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4%</TD></TR>\r\n</TABLE></CENTER>\r\n<P><FORM ACTION=\"/SiteScope/cgi/go.exe/SiteScope\" method=\"POST\">\n<input type=\"hidden\" name=\"page\" value=\"adhocReport\"/>\n<input type=\"hidden\" name=\"queryID\" value=\"1725002550\"/>\n<input type=\"hidden\" name=\"htmlFile\" value=\"yes\"/>\n<input type=\"hidden\" name=\"account\" value=\"login59\"/>\n<input type=\"hidden\" name=\"isFlipperContext\" value=\"false\"/>\n<input type=\"hidden\" name=\"isSwingContext\" value=\"true\"/>\n<input type=\"hidden\" name=\"locale\" value=\"en_US\"/>\n<input type=\"hidden\" name=\"useOldLinks\" value=\"false\"/>\n<input class=\"button\" type=\"submit\" value=\"Generate\" onclick=\"this.disabled=true; this.value= 'Generating. Wait..'; document.forms[0].submit();\" />\n</FORM>\nManagement Report Now - this will immediately generate and save this report, using the most current data\n (<B>Note: </B>This may take a few moments, depending on the speed of the SiteScope machine, the number of monitors and the time period of the report)\n</BODY></HTML>\r\n"
Мне нужно Grep линии, которые начинаются с HREF, которые заканчиваются>. Например,
мне нужно поместить все эти записи в вектор:
HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-15_33-01_25_2015.txt
HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-11_26-01_20_2015.txt
HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-11_26-01_20_2015.txt
и сортировать вектор и список последних из вектора.
Я попытался это:
vec<-as.vector()
vec<-append(grepl("(HREF.*>?"),data, value=TRUE)
не повезло, я бы признателен за любые рекомендации с этим?
Вы могли бы попробовать 'библиотека (XML); xpathSApply (doc, "// * [@ HREF]") ', но я не могу попробовать, потому что html неверен. –
@RichardScriven, я вытаскиваю данные в R и R помещает \ r и \ n символы в файл. Независимо от того, мне нужно извлечь данные, которые начинаются с HREF и заканчивается на> в вектор – user1471980
@RichardScriven, получите эту ошибку: Ошибка в UseMethod ("xpathApply"): не применимый метод для 'xpathApply', примененный к объекту класса " character " – user1471980