2014-10-28 3 views
0

Я просто разобран на веб-страницу с помощью панд:Синтаксический строку в панд, где не разделителем

r = requests.post("https://www.eigroup.co.uk/clients/auctions/fulldetails.aspx?auctionid=17999 ", params=payload) 


parsed_page = pd.read_html(r.text, attrs={"class": "table-search-result"}) 

(пример HTML разбираемый)

<table cellspacing="0" id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1" style="width:100%;border-collapse:collapse;"> 
<tr> 
    <td colspan="2"> 
    <table class="table-search-result"> 
     <tr> 
      <th>66D Charlwood Street, Pimlico, London, SW1V 4PQ</th> 
      <th style="text-align: right; white-space: nowrap;"> 

       <a href="http://www.englishhouseprices.com/results.aspx?postcode=SW1V 4PQ" id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_A2" class="icon" target="_blank"> 
        <img src="/content/images/icons/32/houseprices.png" alt="Compare with Property Prices" title="Compare with Property Prices in this Postcode" /></a> 
       <a id="" title="View Auction Details" class="icon" onclick="return o(this,900,650,1,1)" href="/clients/auctions/details.aspx?auctionid=17999" target="_blank"><img title="View Auction Details" src="/content/images/icons/32/auctiondetails.png" alt="" /></a> 


       <a id="" title="Trend Analysis" class="icon" onclick="return o(this,900,650,1,1)" href="/clients/lots/trend-analysis.aspx?lotid=756425" target="_blank"><img title="Trend Analysis" src="/content/images/icons/32/piechart.png" alt="" /></a> 
       <a href='http://maps.google.co.uk?q=SW1V 4PQ' target="_blank"> 
        <img id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_ImageLocationMap" title="Location Map" class="icon" src="/content/images/icons/32/compass.png" /></a> 
       <a href='http://www.multimap.com/map/photo.cgi?scale=5000&mapsize=big&pc=SW1V 4PQ' target="_blank"> 
        <img id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_ImageAerialPhoto" title="Aerial Photo" class="icon" src="/content/images/icons/32/camera.png" /></a> 
       <a href='/clients/search/search-results.aspx?searchtype=comparable&lotid=756425' title="Find similar properties like this one"> 
        <img src="/content/images/icons/32/find.png" alt="Find other properties matching this tenant" title="Find similar properties like this one" class="icon" /></a> 

       <a href='/clients/search/search-results.aspx?searchtype=history&lotid=756425'> 
        <img src="/content/images/icons/32/history.png" alt="Find history of property in this street" title="Find history of property in this street" class="icon" /></a> 
       <a id="" title="Add to one of my portfolios" class="icon" Title="Add to portfolio" onclick="return o(this,650,500,1,1)" href="/clients/portfolios/lot.aspx?lotid=756425" target="_blank"><img title="Add to one of my portfolios" src="/content/images/icons/32/briefcase.png" alt="" /></a> 
       <a href="https://www.eigroup.co.uk/files/55/17999/6ec339ec-d59e-4b8a-9136-dc6e9a583328.pdf" id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_A4" target="_blank"> 
        <img src="/content/images/icons/32/catalogue.png" alt="Catalogue Entry" class="icon" title="Full Catalogue Entry" /></a> 
       <a id="" title="Add to my shortlist" class="icon" Title="Add to shortlist" onclick="return o(this,900,650,1,1)" href="/clients/lots/shortlist.aspx?lotid=756425" target="shortlist"><img title="Add to my shortlist" src="/content/images/icons/32/shortlist.png" alt="" /></a> 

      </th> 
     </tr> 
     <tr> 
      <td colspan="2" style="background-color: #f5f5f5;"> 
       <table style="width: 100%"> 
        <tr> 
         <td style="background-color: #f1f1f1; width: 170px; text-align: center;"> 
          <a href='/clients/lots/details.aspx?lotid=756425&hb=1' target='756425' onclick="window.open(this.href,this.target,'width=900,height=650,resizable=yes,scrollbars=yes');return false" title="Auction property in Pimlico, London, SW1"> 
           <img id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_Image1" src="https://www.eigroup.co.uk/files/55/17999/de591a4f-7da1-4bcd-a42c-76731bd72a23.jpg" alt="Pimlico, London, SW1" style="border-color:Black;border-width:2px;border-style:Solid;width:150px;" /> 
          </a> 
         </td> 
         <td style="padding-left: 10px; width: 50%;"> 
          <p> 
           <b>Description</b><br /> 
           Leasehold 2nd Floor Studio Flat Unmodernised Vacant 
          </p> 
          <p id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_P1"> 
           <b>Guide Price</b><br /> 
           £450,000 Plus 
          </p> 

          <p> 
           <b>Lot Number</b><br /> 
           2 
          </p> 
          <p> 
           <b> </b> 
          </p> 
         </td> 
         <td style="white-space: nowrap;"> 
          <p> 
           <b>Auctioneer</b><br /> 
           <a id="" onclick="return o(this,900,650,1,1)" href="/clients/auctioneers/details.aspx?auctioneerid=55" target="_blank">Savills (London - National)</a> 

          </p> 
          <p id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_P3"> 
           <b>Vendor</b><br /> 
           Housing Association 
          </p> 

         </td> 
         <td style="white-space: nowrap;"> 
          <p> 
           <b>Auction Date</b><br /> 
           <a id="" onclick="return o(this,900,650,1,1)" href="/clients/auctions/details.aspx?auctionid=17999" target="_blank">28 October 2014</a> 
          </p> 


          <p id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_P7"> 
           <b>Lease Details</b><br /> 
           125 Yr, commencing 01/01/2013 (GR.£250.PA) 
          </p> 
         </td> 
        </tr> 
       </table> 
      </td> 
     </tr> 

    </table> 
</td> 
</tr> 

и я получаю следующее:

In [86]: parsed_page[1][0][1] 
Out[86]: u'Description Leasehold 2nd Floor Studio Flat Unmodernised Vacant Guide Price \xa3450,000 Plus Lot Number 2 Auctioneer Savills (London - National) Vendor Housing Association Auction Date 28 October 2014 Lease Details 125 Yr, commencing 01/01/2013 (GR.\xa3250.PA)' 

Проблема заключается в том, что я хочу иметь возможность извлекать описание, цену руководства и т. Д., Но нет никаких разделителей, и количество символов после этого является переменной. Не хватает ли ключевого слова, когда я разбираюсь?

Как я могу разбить их на новые столбцы?

+0

Показать фактические данные, которые вы разбор, если вы не хотите выдавать ваш авторизоваться. – filmor

+0

Спасибо @PadraicCunningham. Предположительно, я просто делаю это так для всех таблиц? Также, как я могу добавить адрес - он ограничен и? –

+0

да, вы можете использовать доступ к адресу по тегу –

ответ

0

Использование BeautifulSoup, как я рекомендовал в ответ на ваш последний question, вы можете разделить текст и сделать Dict:

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html) 

s = soup.find_all("p") 
details = (ele.text.strip().split("\n") for ele in s) 

d = {} 

for det in details: 
    if len(det) == 2: 
     d[det[0].strip()] = det[1].strip() 

{u'Vendor': u'Housing Association', u'Description': u'Leasehold 2nd Floor Studio Flat Unmodernised Vacant', u'Auction Date': u'28 October 2014', u'Auctioneer': u'Savills (London - National)', u'Lot Number': u'2', u'Guide Price': u'\xc2\u0141450,000 Plus', u'Lease Details': u'125 Yr, commencing 01/01/2013 (GR.\xc2\u0141250.PA)'} 
Смежные вопросы