2017-01-19 4 views
3

Я хотел бы перенести следующий фрейм данных, чтобы экспортировать его в таблицу Oracle.Python - транспонирование Pandas DataFrame

0 ID         Available Quota \ 
1 1724  GOM COD GOM HADD GOM BB GREYSOLE DABS GOM YT 
2 1578 GBE COD GBW COD GB BB GB YT SNE BB SNE YT GOM ... 
3 310 GBE COD GBW COD DABS WHAKE POLL RED SNE BB GOM BB 

0         Live Weight Pounds \ 
1      2328 445 3007 850 3101 1995 
2  538 5894 1755 243 490 153 3965 2727 9227 15060 
3 825 9033 1241 3120 65234 76610 1688 1195 2121 ... 

0            Price Date Posted 
1          Package $9,000  5/20 
2 $1.00 $0.40 $0.20 $1.00 $0.45 $0.50 $0.15 $0.2...  5/20 
3         Package $15,000  5/20 

В идеале, данные должны быть совмещены, как это, так что я могу легко поместить его в моей базе данных Oracle:

enter image description here

и началом второго идентификаторов должен выглядеть следующим образом:

enter image description here

Оригинальная таблица данных выглядит так, моя цель состоит только разобрать йа самая последняя дата в та КСТАТИ:

enter image description here

Использование pd.transpose ничего не изменится, потому что мой DataFrame, по-видимому (3, 5), и она должна быть (5, 5) для того, чтобы работать. А использование pd.melt() в результате:

     0            value 
0     ID            1724 
1     ID            1578 
2     ID            310 
3  Available Quota  GOM COD GOM HADD GOM BB GREYSOLE DABS GOM YT 
4  Available Quota GBE COD GBW COD GB BB GB YT SNE BB SNE YT GOM ... 
5  Available Quota GBE COD GBW COD DABS WHAKE POLL RED SNE BB GOM BB 
6 Live Weight Pounds      2328 445 3007 850 3101 1995 
7 Live Weight Pounds  538 5894 1755 243 490 153 3965 2727 9227 15060 
8 Live Weight Pounds 825 9033 1241 3120 65234 76610 1688 1195 2121 ... 
9    Price          Package $9,000 
10    Price $1.00 $0.40 $0.20 $1.00 $0.45 $0.50 $0.15 $0.2... 
11    Price         Package $15,000 
12   Date Posted            5/20 
13   Date Posted            5/20 
14   Date Posted            5/20 

.... который также не будет работать на экспорт.

Мой соответствующий код:

with open(file_path, 'r') as f: 
      def read_html_latest(filename, **kwargs): 
      #with open(filename) as f: 
       text = f.read().replace('<br>', ' ') 
       df = pd.read_html(text, **kwargs)[0] 
       column_headers = ['ID', 'Available Quota', 'Live Weight Pounds', 'Price', 'Date Posted'] 
       df.columns = df.loc[0] 
       df = df.loc[1:] 
       return df.assign(d=pd.to_datetime(df['Date Posted'], format='%m/%d')) \ 
         .query('d == d.max()') \ 
         .drop('d', 1) 
      df = read_html_latest(filename, attrs={'class': 'MsoNormalTable'}) 
      print(df) 

Любая помощь в решении этого было бы весьма признателен, спасибо много.

Источник HTML код:

<html> 
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 
<title>FW: NEFS 2 Available Quota 5/21</title> 
<link rel="important stylesheet" href=""> 
<style>div.headerdisplayname {font-weight:bold;}</style></head> 
<body> 
<table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part1"><tr><td><b>Subject: </b>FW: NEFS 2 Available Quota 5/21</td></tr><tr><td><b>From: </b>Claire Fitz-Gerald <[email protected]></td></tr><tr><td><b>Date: </b>5/21/2014 10:08 AM</td></tr></table><br> 
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><META HTTP-EQUIV="Content-Type" CONTENT="text/html; "><meta name=Generator content="Microsoft Word 12 (filtered medium)"><!--[if !mso]><style>v\:* {behavior:url(#default#VML);} 
o\:* {behavior:url(#default#VML);} 
w\:* {behavior:url(#default#VML);} 
.shape {behavior:url(#default#VML);} 
</style><![endif]--><style><!-- 
/* Font Definitions */ 
@font-face 
    {font-family:"Cambria Math"; 
    panose-1:2 4 5 3 5 4 6 3 2 4;} 
@font-face 
    {font-family:Calibri; 
    panose-1:2 15 5 2 2 2 4 3 2 4;} 
@font-face 
    {font-family:Tahoma; 
    panose-1:2 11 6 4 3 5 4 4 2 4;} 
@font-face 
    {font-family:"Franklin Gothic Book"; 
    panose-1:2 11 5 3 2 1 2 2 2 4;} 
@font-face 
    {font-family:"Franklin Gothic Demi"; 
    panose-1:2 11 7 3 2 1 2 2 2 4;} 
/* Style Definitions */ 
p.MsoNormal, li.MsoNormal, div.MsoNormal 
    {margin:0in; 
    margin-bottom:.0001pt; 
    font-size:11.0pt; 
    font-family:"Calibri","sans-serif";} 
a:link, span.MsoHyperlink 
    {mso-style-priority:99; 
    color:blue; 
    text-decoration:underline;} 
a:visited, span.MsoHyperlinkFollowed 
    {mso-style-priority:99; 
    color:purple; 
    text-decoration:underline;} 
span.EmailStyle17 
    {mso-style-type:personal; 
    font-family:"Calibri","sans-serif"; 
    color:windowtext;} 
span.title1 
    {mso-style-name:title1; 
    font-family:"Arial","sans-serif"; 
    color:#1F487E; 
    font-weight:normal;} 
span.EmailStyle19 
    {mso-style-type:personal-reply; 
    font-family:"Calibri","sans-serif"; 
    color:#1F497D;} 
.MsoChpDefault 
    {mso-style-type:export-only; 
    font-size:10.0pt;} 
@page WordSection1 
    {size:8.5in 11.0in; 
    margin:1.0in 1.0in 1.0in 1.0in;} 
div.WordSection1 
    {page:WordSection1;} 
--></style><!--[if gte mso 9]><xml> 
<o:shapedefaults v:ext="edit" spidmax="1026" /> 
</xml><![endif]--><!--[if gte mso 9]><xml> 
<o:shapelayout v:ext="edit"> 
<o:idmap v:ext="edit" data="1" /> 
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='color:#1F497D'>Please see the below quota listings.<o:p></o:p></span></p><p class=MsoNormal><span style='color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='color:#1F497D'>Thanks,<o:p></o:p></span></p><p class=MsoNormal><span style='color:#1F497D'><o:p>&nbsp;</o:p></span></p><div><p class=MsoNormal><span style='font-size:12.0pt;font-family:"Franklin Gothic Book","sans-serif";color:#1F497D'>Claire Fitz-Gerald<o:p></o:p></span></p><p class=MsoNormal><i><span style='font-size:10.0pt;font-family:"Franklin Gothic Book","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></i></p><p class=MsoNormal><b><span style='font-family:"Franklin Gothic Demi","sans-serif";color:#002776'>Cape Cod Commercial Fishermen's Alliance<o:p></o:p></span></b></p><p class=MsoNormal><b><span style='font-family:"Franklin Gothic Book","sans-serif";color:#DE3500'>~ Small Boats.&nbsp; Big Ideas. ~</span></b><b><span style='color:#DE3500'><o:p></o:p></span></b></p></div><p class=MsoNormal><span style='color:#1F497D'><o:p>&nbsp;</o:p></span></p><div><div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'><p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> David Leveille [mailto:[email protected]] <br><b>Sent:</b> Wednesday, May 21, 2014 8:50 AM<br><b>To:</b> David Leveille<br><b>Subject:</b> NEFS 2 Available Quota 5/21<o:p></o:p></span></p></div></div><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal><span style='font-size:12.0pt;font-family:"Arial","sans-serif";color:#1F487E'>AVAILABLE QUOTA FY 2014</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p><table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width="71%" style='width:71.28%'><tr><td width=220 style='width:164.95pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><b><span style='font-size:9.0pt;font-family:"Arial","sans-serif";color:black'>ID <o:p></o:p></span></b></p></td><td width=161 style='width:120.75pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='mso-line-height-alt:15.0pt'><b><span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:black'>Available Quota <o:p></o:p></span></b></p></td><td width=189 style='width:141.75pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='mso-line-height-alt:15.0pt'><b><span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:black'>Live Weight Pounds <o:p></o:p></span></b></p></td><td width=126 style='width:94.55pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='mso-line-height-alt:15.0pt'><b><span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:black'>Price <o:p></o:p></span></b></p></td><td width=168 style='width:125.95pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='mso-line-height-alt:15.0pt'><b><span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:black'>Date Posted <o:p></o:p></span></b></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1724<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GOM COD<br>GOM HADD<br>GOM BB<br>GREYSOLE<br>DABS<br>GOM YT<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>2328<br>445<br>3007<br>850<br>3101<br>1995<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>Package<o:p></o:p></span></p><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$9,000<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/20<o:p></o:p></span></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1578<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GBE COD<br>GBW COD<br>GB BB<br>GB YT<br>SNE BB<br>SNE YT<br>GOM BB<br>Whake<br>POLL<br>RED<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>538<br>5894<br>1755<br>243<br>490<br>153<br>3965<br>2727<br>9227<br>15060<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$1.00<br>$0.40<br>$0.20<br>$1.00<br>$0.45<br>$0.50<br>$0.15<br>$0.20<br>$0.01<br>$0.01<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/20<o:p></o:p></span></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>310<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GBE COD<br>GBW COD<br>DABS<br>WHAKE<br>POLL<br>RED<br>SNE BB<br>GOM BB<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>825<br>9033<br>1241<br>3120<br>65234<br>76610<br>1688<br>1195<br>2121<br>7285<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>Package<o:p></o:p></span></p><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$15,000<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/20<o:p></o:p></span></p></td></tr><tr style='height:23.25pt'><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>347<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>SNE BB<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>8,000<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$0.50<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/7<o:p></o:p></span></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1878A<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GOM COD<br>GOM HADD<br>SNE BB<br>GOM BB<br>GB BB<br>GREYSOLE<br>GOM YT<br>SNE YT<br>POLL<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>6188<br>635<br>3916<br>7873<br>6762<br>3358<br>9776<br>271<br>186550<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$1.95<br>$1.35<br>$0.50<br>$0.50<br>$0.20<br>$1.40<br>$1.20<br>$0.50<br>$0.01<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/12<o:p></o:p></span></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1878B<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GBE COD<br>GBW COD<br>GB YT<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1113<br>12186<br>850<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>Package<br>$10,000<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/12<o:p></o:p></span></p></td></tr></table><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>David Leveille<o:p></o:p></p><p class=MsoNormal>II Northeast Fishery Sector Inc.<o:p></o:p></p><p class=MsoNormal>10 Witham Street<o:p></o:p></p><p class=MsoNormal>Gloucester, MA. 01930<o:p></o:p></p><p class=MsoNormal>Cell 978 375 3509<o:p></o:p></p><p class=MsoNormal>Fax 978 281 1555<o:p></o:p></p><p class=MsoNormal>Web <a href="http://nefs2.com/">http://nefs2.com/</a><o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><div class=MsoNormal align=center style='text-align:center'><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'></body></html> 
</body> 
</html> 
+1

Как определить, сколько текстовых значений будут в «Available Квота», я вижу одну или несколько текстов. Также, как вы хотите, чтобы цена со второго ряда? – Shijo

+0

Ну, в столбце Доступная квота есть конечное количество видов, я не знаю, почему он печатает только 7 или 8, а затем ставит «...». И я редактировал на рисунке, что цены во второй строке должен выглядеть; они должны все совпадать с соответствующими квотами – theprowler

+1

теперь это имеет больше смысла благодаря :) – Shijo

ответ

2

Это работает код считывает через каждые клетки, создает списки, а затем список для кадра данных. Обратите внимание, что этот код будет работать только тогда, когда количество элементов одинаково во всей ячейке по строке..

from bs4 import BeautifulSoup, NavigableString, Tag 
import pandas as pd 
import numpy as np 
def celltext(cell): 
    '''  
     textlist=[] 
     for br in cell.findAll('br'): 
      next = br.nextSibling 
      if not (next and isinstance(next,NavigableString)): 
       continue 
      next2 = next.nextSibling 
      if next2 and isinstance(next2,Tag) and next2.name == 'br': 
       text = str(next).strip() 
       if text: 
        textlist.append(next) 
     return (textlist) 
    ''' 
    textlist=[] 
    y = cell.find('span') 
    for a in y.childGenerator(): 
     if isinstance(a, NavigableString): 
      textlist.append(str(a)) 
    return (textlist) 

html=open('patht\to\html.html','r').read() 
soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string 
table = soup.find_all('table')[1] # Grab the second table 

df_Quota = pd.DataFrame() 

for row in table.find_all('tr'):  
    columns = row.find_all('td') 
    if columns[0].get_text().strip()<>'ID': # skip header 
     Quota = celltext(columns[1]) 
     Weight = celltext(columns[2]) 
     price = celltext(columns[3]) 

     Nrows= max([len(Quota),len(Weight),len(price)]) #get the max number of rows 

     IDList = [columns[0].get_text()] * Nrows 
     DateList = [columns[4].get_text()] * Nrows 

     if price[0].strip()=='Package': 
      price = [columns[3].get_text()] * Nrows 

     if len(Quota)<len(Weight): #if Quota has less itmes extened with nan 
      lstnans= [np.nan]*(len(Weight)-len(Quota)) 
      Quota.extend(lstnans) 

     FinalDataframe = pd.DataFrame(
     { 
     'ID':IDList,  
     'AvailableQuota': Quota, 
     'LiveWeightPounds': Weight, 
     'price':price, 
     'DatePosted':DateList 
     }) 
    df_Quota= df_Quota.append(FinalDataframe) 
print df_Quota 

выход

AvailableQuota DatePosted  ID LiveWeightPounds   price 
0  GOM COD  5/12 1878A    6188   $1.95 
1  GOM HADD  5/12 1878A    635   $1.35 
2   SNE BB  5/12 1878A    3916   $0.50 
3   GOM BB  5/12 1878A    7873   $0.50 
4   GB BB  5/12 1878A    6762   $0.20 
5  GREYSOLE  5/12 1878A    3358   $1.40 
6   GOM YT  5/12 1878A    9776   $1.20 
7   SNE YT  5/12 1878A    271   $0.50 
8   POLL  5/12 1878A   186550   $0.01 
0  GOM COD  5/20 1724    2328 Package $9,000 
1  GOM HADD  5/20 1724    445 Package $9,000 
2   GOM BB  5/20 1724    3007 Package $9,000 
3  GREYSOLE  5/20 1724    850 Package $9,000 
4   DABS  5/20 1724    3101 Package $9,000 
5   GOM YT  5/20 1724    1995 Package $9,000 
0  GBE COD  5/20 1578    538   $1.00 
1  GBW COD  5/20 1578    5894   $0.40 
2   GB BB  5/20 1578    1755   $0.20 
3   GB YT  5/20 1578    243   $1.00 
4   SNE BB  5/20 1578    490   $0.45 
5   SNE YT  5/20 1578    153   $0.50 
6   GOM BB  5/20 1578    3965   $0.15 
7   Whake  5/20 1578    2727   $0.20 
8   POLL  5/20 1578    9227   $0.01 
9   RED  5/20 1578   15060   $0.01 
0  GBE COD  5/20 310    825 Package $15,000 
1  GBW COD  5/20 310    9033 Package $15,000 
2   DABS  5/20 310    1241 Package $15,000 
3   WHAKE  5/20 310    3120 Package $15,000 
4   POLL  5/20 310   65234 Package $15,000 
5   RED  5/20 310   76610 Package $15,000 
6   SNE BB  5/20 310    1688 Package $15,000 
7   GOM BB  5/20 310    1195 Package $15,000 
8   NaN  5/20 310    2121 Package $15,000 
9   NaN  5/20 310    7285 Package $15,000 
0   SNE BB  5/7 347   8,000   $0.50 
0  GOM COD  5/12 1878A    6188   $1.95 
1  GOM HADD  5/12 1878A    635   $1.35 
2   SNE BB  5/12 1878A    3916   $0.50 
3   GOM BB  5/12 1878A    7873   $0.50 
4   GB BB  5/12 1878A    6762   $0.20 
5  GREYSOLE  5/12 1878A    3358   $1.40 
6   GOM YT  5/12 1878A    9776   $1.20 
7   SNE YT  5/12 1878A    271   $0.50 
8   POLL  5/12 1878A   186550   $0.01 
0  GBE COD  5/12 1878B    1113 Package$10,000 
1  GBW COD  5/12 1878B   12186 Package$10,000 
2   GB YT  5/12 1878B    850 Package$10,000 
+0

Wow! это мило! – MaxU

+0

Ничего себе, что выглядит абсолютно безупречно. Было бы легко исключить все даты, кроме самого последнего? В этом случае сохраняйте только данные, соответствующие 5/20? – theprowler

+1

В приведенном примере имеется различное количество элементов в строке html 3, содержит 8 элементов в столбце «Доступная квота» и 10 элементов в «Весовых фунтах стерлингов». проверьте, как вы их управляете – Shijo

Смежные вопросы