[Python] Pandas Tutorial

Python/Pandas Tutorial

[Python] Pandas Tutorial :: groupby

슈퍼짱짱 2020. 10. 7. 16:40

2020/09/18 - [Python/Pandas Tutorial] - [Python] Pandas Tutorial :: pandas란? 데이터프레임이란? 시리즈란?

2020/09/19 - [Python/Pandas Tutorial] - [Python] Pandas Tutorial :: read csv, txt file with pandas

2020/09/21 - [Python/Pandas Tutorial] - [Python] Pandas Tutorial :: Create Data Frame with Dictionary, List

2020/09/21 - [Python/Pandas Tutorial] - [Python] Pandas Tutorial :: Save Pandas Data Frame to CSV file

2020/09/21 - [Python/Pandas Tutorial] - [Python] Pandas Tutorial :: Filtering(Selecting) rows, columns in pandas DataFrame

2020/09/22 - [Python/Pandas Tutorial] - [Python] Pandas Tutorial :: Drop row or column in pandas DataFrame

2020/09/22 - [Python/Pandas Tutorial] - [Python] Pandas Tutorial :: add row, column

판다스 groupby 로 집계하기

실습은 iris 데이터로 진행한다.

iris.csv

0.00MB

먼저 pandas 를 import 하고 실습 데이터를 불러온다.

# 0. import pandas library
import pandas as pd

# 1. load iris data
data = pd.read_csv("01. Data/iris.csv")

> data.head()

sepal.length sepal.width petal.length petal.width variety

0 5.1 3.5 1.4 0.2 Setosa

1 4.9 3.0 1.4 0.2 Setosa

2 4.7 3.2 1.3 0.2 Setosa

3 4.6 3.1 1.5 0.2 Setosa

4 5.0 3.6 1.4 0.2 Setosa

개인적인 편의를 위해 variety 컬럼명을 Species로 바꾸어 주었다. (R에서 iris 데이터 컬럼명과 맞춰주었다.)

data = data.rename(columns = {"variety":"Species"})

각 컬럼의 타입은 다음과 같다.

> data.dtypes

sepal.length float64

sepal.width float64

petal.length float64

petal.width float64

Species object

dtype: object

왼쪽부터 4개 컬럼은 숫자형, 마지막 Species 컬럼만 character 형이다.

참고로 Species 는 'Setosa', 'Versicolor', 'Virginica' 로 이루어져 있다. (data.Species.unique() 로 확인 가능)

data.describe() 로 numerial 컬럼들의 간단한 분포를 파악할 수 있다. 이는 R 에서 summary 와 비슷한 역할을 한다.

> data.describe()

sepal.length sepal.width petal.length petal.width

count 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.057333 3.758000 1.199333

std 0.828066 0.435866 1.765298 0.762238

min 4.300000 2.000000 1.000000 0.100000

25% 5.100000 2.800000 1.600000 0.300000

50% 5.800000 3.000000 4.350000 1.300000

75% 6.400000 3.300000 5.100000 1.800000

max 7.900000 4.400000 6.900000 2.500000

꽃의 종류인 Species 별 데이터 개수를 파악하는 방법은 다음과 같다.

> data.groupby("Species").size()

Species
Setosa        50
Versicolor    50
Virginica     50
dtype: int64

참고로 각 group 별 이름, 개수, 상위 2개 데이터를 보기 좋게 print 하는 방법은 다음과 같다.

groupby_Species = data.groupby('Species')

for name, group in groupby_Species :
    print(name + " : " + str(len(group))) ## group 별 개수
    print(group.head(2)) ## group 별 상위 2개 데이터셋
    print() ## 줄바꿈

Setosa : 50

sepal.length sepal.width petal.length petal.width Species

0 5.1 3.5 1.4 0.2 Setosa

1 4.9 3.0 1.4 0.2 Setosa

Versicolor : 50

sepal.length sepal.width petal.length petal.width Species

50 7.0 3.2 4.7 1.4 Versicolor

51 6.4 3.2 4.5 1.5 Versicolor

Virginica : 50

sepal.length sepal.width petal.length petal.width Species

100 6.3 3.3 6.0 2.5 Virginica

101 5.8 2.7 5.1 1.9 Virginica

이제 sepal.length 를 Species 별로 aggregate 하는 방법이다.

평균은 다음과 같이 구한다.

data.groupby("Species")["sepal.length"].mean()

Species

Setosa 5.006

Versicolor 5.936

Virginica 6.588

Name: sepal.length, dtype: float64

평균이 아니라 합, 최소값, 최대값은 mean() 대신 sum(), min(), max() 로 가능하다. 원하는 function을 넣어주면 된다.

직접 정의한 function도 적용 할 수 있는데 apply("FUNCTION") 으로 넣어주면 된다.

예를 들어, (최대값 - 최소값) 을 return 하는 fun 이라는 이름의 function 을 적용한다고 가정하면 적용하는 방법은 다음과 같다.

def fun(x) :
    return(max(x) - min(x))

data.groupby("Species")["sepal.length"].apply(fun)

Species

Setosa 1.5

Versicolor 2.1

Virginica 3.0

Name: sepal.length, dtype: float64

특정 variable 이 아닌 모든 숫자형 변수에 대해 특정 group으로 aggrgate 하는 방법은 다음과 같다.

데이터에 숫자형이 아닌 컬럼에 대해서는 결과를 return 해주지 않는다.

data.groupby(data.Species).mean()

sepal.length sepal.width petal.length petal.width

Species

Setosa 5.006 3.428 1.462 0.246

Versicolor 5.936 2.770 4.260 1.326

Virginica 6.588 2.974 5.552 2.026

두 개의 변수로 groupby 하는 방법은 다음과 같다.

먼저, sepal.width 를 기준으로 3보다 작으면 "low", 크면 "high" 를 가진 sepal.width2 라는 컬럼을 생성해준다.

import numpy as np
data['sepal.width2'] = np.where(data['sepal.width'] < 3 , "low", "high")

> data.head(2)

sepal.length sepal.width petal.length petal.width Species sepal.width2

0 5.1 3.5 1.4 0.2 Setosa high

1 4.9 3.0 1.4 0.2 Setosa high

참고로 생성된 high 와 low의 개수는 다음과 같다. value_counts() 는 R 에서 table() 과 같은 결과를 리턴한다.

> data['sepal.width2'].value_counts()

high 93
low 57

이제 Species 와 sepal.width2 로 집계하는 방법은 다음과 같다.

> data.groupby(["Species","sepal.width2"]).mean()

sepal.length sepal.width petal.length petal.width

Species sepal.width2

Setosa high 5.029167 3.462500 1.466667 0.245833

low 4.450000 2.600000 1.350000 0.250000

Versicolor high 6.218750 3.100000 4.543750 1.487500

low 5.802941 2.614706 4.126471 1.250000

Virginica high 6.768966 3.182759 5.641379 2.127586

low 6.338095 2.685714 5.428571 1.885714

unstack() 으로 좀 더 보기 좋은 결과를 얻을 수 있다.

> data.groupby(["Species","sepal.width2"]).mean().unstack()

sepal.length sepal.width petal.length \

sepal.width2 high low high low high

Species

Setosa 5.029167 4.450000 3.462500 2.600000 1.466667

Versicolor 6.218750 5.802941 3.100000 2.614706 4.543750

Virginica 6.768966 6.338095 3.182759 2.685714 5.641379

petal.width

sepal.width2 low high low

Species

Setosa 1.350000 0.245833 0.250000

Versicolor 4.126471 1.487500 1.250000

Virginica 5.428571 2.127586 1.885714

저작자표시 비영리 변경금지 (새창열림)

'Python > Pandas Tutorial' 카테고리의 다른 글

[Python] pandas tutorial :: drop duplicates in pandas (0)	2020.10.12
[Python] Pandas Tutorial :: groupby transform (groupby 결과 컬럼에 추가하기) (0)	2020.10.07
[Python] Pandas Tutorial :: add row, column (0)	2020.09.22
[Python] Pandas Tutorial :: Drop row or column in pandas DataFrame (0)	2020.09.22
[Python] Pandas Tutorial :: Filtering(Selecting) rows, columns in pandas DataFrame (0)	2020.09.21

현재글[Python] Pandas Tutorial :: groupby

슈퍼짱짱