2015-09-16 2 views
2

Я новичок в Python (используя Anaconda w/Python v3.4.3) и не смог найти этот ответ в любом месте, но он кажется настолько важным, что я должен идти об этом в Неправильный путь.Выбрать записи по групповому условию

import pandas as pd 
url = 'https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv' 
tips = pd.read_csv(url) 
tips.head(5) 
Out[1]: 
    total_bill tip  sex smoker day time size 
0  16.99 1.01 Female  No Sun Dinner  2 
1  10.34 1.66 Male  No Sun Dinner  3 
2  21.01 3.50 Male  No Sun Dinner  3 
3  23.68 3.31 Male  No Sun Dinner  2 
4  24.59 3.61 Female  No Sun Dinner  4 

Я хотел бы, чтобы выбрать записи, в которых day группа имеет по крайней мере 50 записей.

sel_days = tips.groupby("day").size() > 50 
sel_days 
Out[2]: 
day 
Fri  False 
Sat  True 
Sun  True 
Thur  True 
dtype: bool 

Я вижу, что это серия, но не могу показаться, чтобы выяснить, как генерировать булеву последовательность для выбора строки из исходного набора tips.

type(sel_days) 
Out[3]: pandas.core.series.Series 
print(x in sel_days for x in tips["day"]) 
<generator object <genexpr> at 0x0000000007DBDFC0> 

Как бы это сделать?

ответ

4

Вы хотите filter:

In [22]: 
tips.groupby('day').filter(lambda x: len(x) > 50) 

Out[22]: 
    total_bill tip  sex smoker day time size 
0   16.99 1.01 Female  No Sun Dinner  2 
1   10.34 1.66 Male  No Sun Dinner  3 
2   21.01 3.50 Male  No Sun Dinner  3 
3   23.68 3.31 Male  No Sun Dinner  2 
4   24.59 3.61 Female  No Sun Dinner  4 
5   25.29 4.71 Male  No Sun Dinner  4 
6   8.77 2.00 Male  No Sun Dinner  2 
7   26.88 3.12 Male  No Sun Dinner  4 
8   15.04 1.96 Male  No Sun Dinner  2 
9   14.78 3.23 Male  No Sun Dinner  2 
10  10.27 1.71 Male  No Sun Dinner  2 
11  35.26 5.00 Female  No Sun Dinner  4 
12  15.42 1.57 Male  No Sun Dinner  2 
13  18.43 3.00 Male  No Sun Dinner  4 
14  14.83 3.02 Female  No Sun Dinner  2 
15  21.58 3.92 Male  No Sun Dinner  2 
16  10.33 1.67 Female  No Sun Dinner  3 
17  16.29 3.71 Male  No Sun Dinner  3 
18  16.97 3.50 Female  No Sun Dinner  3 
19  20.65 3.35 Male  No Sat Dinner  3 
20  17.92 4.08 Male  No Sat Dinner  2 
21  20.29 2.75 Female  No Sat Dinner  2 
22  15.77 2.23 Female  No Sat Dinner  2 
23  39.42 7.58 Male  No Sat Dinner  4 
24  19.82 3.18 Male  No Sat Dinner  2 
25  17.81 2.34 Male  No Sat Dinner  4 
26  13.37 2.00 Male  No Sat Dinner  2 
27  12.69 2.00 Male  No Sat Dinner  2 
28  21.70 4.30 Male  No Sat Dinner  2 
29  19.65 3.00 Female  No Sat Dinner  2 
..   ... ...  ... ... ...  ... ... 
207  38.73 3.00 Male Yes Sat Dinner  4 
208  24.27 2.03 Male Yes Sat Dinner  2 
209  12.76 2.23 Female Yes Sat Dinner  2 
210  30.06 2.00 Male Yes Sat Dinner  3 
211  25.89 5.16 Male Yes Sat Dinner  4 
212  48.33 9.00 Male  No Sat Dinner  4 
213  13.27 2.50 Female Yes Sat Dinner  2 
214  28.17 6.50 Female Yes Sat Dinner  3 
215  12.90 1.10 Female Yes Sat Dinner  2 
216  28.15 3.00 Male Yes Sat Dinner  5 
217  11.59 1.50 Male Yes Sat Dinner  2 
218  7.74 1.44 Male Yes Sat Dinner  2 
219  30.14 3.09 Female Yes Sat Dinner  4 
227  20.45 3.00 Male  No Sat Dinner  4 
228  13.28 2.72 Male  No Sat Dinner  2 
229  22.12 2.88 Female Yes Sat Dinner  2 
230  24.01 2.00 Male Yes Sat Dinner  4 
231  15.69 3.00 Male Yes Sat Dinner  3 
232  11.61 3.39 Male  No Sat Dinner  2 
233  10.77 1.47 Male  No Sat Dinner  2 
234  15.53 3.00 Male Yes Sat Dinner  2 
235  10.07 1.25 Male  No Sat Dinner  2 
236  12.60 1.00 Male Yes Sat Dinner  2 
237  32.83 1.17 Male Yes Sat Dinner  2 
238  35.83 4.67 Female  No Sat Dinner  3 
239  29.03 5.92 Male  No Sat Dinner  3 
240  27.18 2.00 Female Yes Sat Dinner  2 
241  22.67 2.00 Male Yes Sat Dinner  2 
242  17.82 1.75 Male  No Sat Dinner  2 
243  18.78 3.00 Female  No Thur Dinner  2 

[225 rows x 7 columns] 
0

Я хотел бы добавить новый столбец в tips dataframe отображения булевых масок:

tips['mask'] = tips['day'].map(sel_days) 

, а затем выбрать только истинные значения:

tips = tips[tips['mask']] 
Смежные вопросы