서서히 그러다 갑자기

S3에 저장된 ORC, Parquet 파일을 pandas로 읽기

카테고리 없음 2019. 6. 17. 14:11

데이터가 크지 않다면, pandas + jupyter의 조합으로 데이터를 분석하는 것이 편할 때가 있다.

단, 데이터 pipeline이 서버 상에서 이루어지는 것이 보통이고, AWS를 사용한다면 S3에 저장될 가능성이 높은데,

작업을 할때 마다 일일이 다운 받기는 귀찮은 작업이다.

준비

jupyter를 실행하고 있는 instance에 S3의 target bucket을 읽을 수 있도록 role을 부여한다(IAM)

import boto3
import pyarrow
import pyarrow.parquet as parquet
import pyarrow.orc as orc
import pandas as pd

def read_surprise_table(obj):
    buf = obj.get()['Body'].read()
    reader = pyarrow.BufferReader(buf)
    data = orc.ORCFile(reader)
    df  = data.read().to_pandas()

    print("df.columns: ", df.columns)
    print("df.shape: ", df.shape)

    return df

s3 = boto3.resource('s3')
bucket = 'bucket_name'   # bucket name
s3bucket = s3.Bucket(bucket)
merged_df = pd.DataFrame()
path = 'folder1/folder2/dt=' + today_str + '/'  # bucket name을 제외한 object path

# 데이터가 여러 파일로 나누어진 경우 loop돌면서 읽고 하나의 dataframe으로 concat한다.
for obj in s3bucket.objects.filter(Delimiter='/', Prefix=path):
    print("bucket:{}, key:{}".format(obj.bucket_name, obj.key))
    if obj.key.endswith("_SUCCESS") :
        continue
    tmp = read_surprise_table(obj)
    if merged_df.shape[0] == 0 :
        merged_df = tmp
    else :
        merged_df = pd.concat([merged_df, tmp], axis=0)  

print("merged_df.columns: ", merged_df.columes)
print("merged_df.shape: ", merged_df.shape)

Posted by poterius

,

Pandas, 실수형인 두 column으로 heatmap 그리기

카테고리 없음 2019. 6. 5. 14:46

iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
iris_df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
iris_df.head(5)

df = iris_df[["sepal_length", "sepal_width"]]
display(df.head(5))
df = round(df)
df2 = df.groupby(["sepal_length", "sepal_width"]).size()
df3= df2.unstack()
display(df3)

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(5, 5))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(100, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(df3, cmap=cmap, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

Posted by poterius

,

jupyter notebook 에서 화면 폭

카테고리 없음 2019. 4. 29. 16:17

jupyter notebook에서 화면 폭을 넓게 쓰려면,

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

Posted by poterius

,

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

서서히 그러다 갑자기

S3에 저장된 ORC, Parquet 파일을 pandas로 읽기

Pandas, 실수형인 두 column으로 heatmap 그리기

jupyter notebook 에서 화면 폭

카테고리

태그목록

최근에 올라온 글

최근에 달린 댓글

글 보관함

달력

링크

티스토리툴바


	by poterius