2015-06-02 2 views
0

Использование XPathSapply в R, я пытаюсь, чтобы получить URL в Edgar: атрибут URL:Использование XPath для извлечения атрибутов узла и атрибутов с двоеточием в идентификаторах

<edgar:xbrlFile edgar:sequence="3" edgar:file="edgr-2004_10k.xml" edgar:type="EX-100.INS" edgar:size="25257" edgar:description="" edgar:url="http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/edgr-2004_10k.xml" /> 

Я попробовал несколько вариантов из следующих действий:

url <- "http://www.sec.gov/Archives/edgar/monthly/xbrlrss-2005-04.xml" 
data <- getURL(url) 
doc <- xmlParse(data) 
url <- xpathSApply(doc, "//item/*[name()='edgar:xbrlFiling']", xmlValue) 

Ниже приведен пример элемента из URL, указанный в приведенном выше коде:

<item> 
    <title>EDGAR ONLINE INC (0001080224) (Filer)</title> 
     <link>http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/0001275287-05-001434-index.htm</link> 
    <description>8-K</description> 
    <pubDate>Mon, 25 Apr 2005 15:15:09 EDT</pubDate> 
    <edgar:xbrlFiling xmlns:edgar="http://www.sec.gov/Archives/edgar"> 
    <edgar:companyName>EDGAR ONLINE INC</edgar:companyName> 
    <edgar:formType>8-K</edgar:formType> 
    <edgar:filingDate>04/25/2005</edgar:filingDate> 
    <edgar:cikNumber>0001080224</edgar:cikNumber> 
    <edgar:accessionNumber>0001275287-05-001434</edgar:accessionNumber> 
    <edgar:fileNumber>001-32194</edgar:fileNumber> 
    <edgar:acceptanceDatetime>20050425151509</edgar:acceptanceDatetime> 
    <edgar:period>20050425</edgar:period> 
    <edgar:assistantDirector>2 &amp; 3</edgar:assistantDirector> 
    <edgar:assignedSic>7389</edgar:assignedSic> 
    <edgar:fiscalYearEnd>1204</edgar:fiscalYearEnd> 
    <edgar:xbrlFiles> 
     <edgar:xbrlFile edgar:sequence="1" edgar:file="eo2425.txt" edgar:type="8-K" edgar:size="5282" edgar:description="" edgar:url="http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/eo2425.txt" /> 
     <edgar:xbrlFile edgar:sequence="2" edgar:file="eo2425ex991.txt" edgar:type="EX-99.1" edgar:size="4469" edgar:description="" edgar:url="http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/eo2425ex991.txt" /> 
     <edgar:xbrlFile edgar:sequence="3" edgar:file="edgr-2004_10k.xml" edgar:type="EX-100.INS" edgar:size="25257" edgar:description="" edgar:url="http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/edgr-2004_10k.xml" /> 
     <edgar:xbrlFile edgar:sequence="4" edgar:file="edgr-20050228.xsd" edgar:type="EX-100.SCH" edgar:size="12111" edgar:description="" edgar:url="http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/edgr-20050228.xsd" /> 
     <edgar:xbrlFile edgar:sequence="5" edgar:file="edgr-20050228_cal.xml" edgar:type="EX-100.CAL" edgar:size="18069" edgar:description="" edgar:url="http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/edgr-20050228_cal.xml" /> 
     <edgar:xbrlFile edgar:sequence="6" edgar:file="edgr-20050228_lab.xml" edgar:type="EX-100.LAB" edgar:size="51434" edgar:description="" edgar:url="http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/edgr-20050228_lab.xml" /> 
     <edgar:xbrlFile edgar:sequence="7" edgar:file="edgr-20050228_pre.xml" edgar:type="EX-100.PRE" edgar:size="27275" edgar:description="" edgar:url="http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/edgr-20050228_pre.xml" /> 
    </edgar:xbrlFiles> 
    </edgar:xbrlFiling> 
</item> 
<item> 
+0

Возможный дубликат: http://stackoverflow.com/a/25316044/423105 – LarsH

+0

Не дубликат. Вопрос заключается в том, что атрибут node AND имеет двоеточие. – Optimus

ответ

2

Это довольно просто и с XML, и если вы используете xml2 (который временно установлен только для github).

XML:

xpathSApply(doc, "//edgar:xbrlFile", xmlGetAttr, "edgar:url", namespaces="edgar") 

xml2:

library(xml2) 
dat <- read_xml(url) 

dat %>% 
    xml_find_all("//edgar:xbrlFile", ns=xml_ns(dat)) %>% 
    xml_attr("edgar:url", ns=xml_ns(dat)) 

Оба обеспечивают те же результаты:

## [1] "http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/eo2425.txt"    
## [2] "http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/eo2425ex991.txt"  
## [3] "http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/edgr-2004_10k.xml"  
## [4] "http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/edgr-20050228.xsd"  
## [5] "http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/edgr-20050228_cal.xml" 
## [6] "http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/edgr-20050228_lab.xml" 
## [7] "http://www.sec.gov/Archives/edgar/data/1080224/000127528705001434/edgr-20050228_pre.xml" 
## [8] "http://www.sec.gov/Archives/edgar/data/29669/000119312505068717/d8k.htm"     
## [9] "http://www.sec.gov/Archives/edgar/data/29669/000119312505068717/xrrd-20050331.xml"  
## [10] "http://www.sec.gov/Archives/edgar/data/29669/000119312505068717/xrrd-20050331.xsd"  
## [11] "http://www.sec.gov/Archives/edgar/data/29669/000119312505068717/xrrd-20050331_cal.xml" 
## [12] "http://www.sec.gov/Archives/edgar/data/29669/000119312505068717/xrrd-20050331_lab.xml" 
## [13] "http://www.sec.gov/Archives/edgar/data/29669/000119312505068717/xrrd-20050331_pre.xml" 
## [14] "http://www.sec.gov/Archives/edgar/data/13610/000095/bne-20050404_8kfinal.htm" 
## [15] "http://www.sec.gov/Archives/edgar/data/13610/000095/bne-20041231er.xml"  
## [16] "http://www.sec.gov/Archives/edgar/data/13610/000095/bne-20050307er.xsd"  
## [17] "http://www.sec.gov/Archives/edgar/data/13610/000095/bne-20050307er_pre.xml" 
## [18] "http://www.sec.gov/Archives/edgar/data/13610/000095/bne-20050307er_lab.xml" 
## [19] "http://www.sec.gov/Archives/edgar/data/13610/000095/bne-20050307er_cal.xml" 
+0

Я знал, что это должно быть проще, чем я это делал. Благодаря! – Optimus

Смежные вопросы