'분류 전체보기' 카테고리의 글 목록 (2 Page)

'분류 전체보기'에 해당되는 글 18건

2019.06.05 Pandas, 실수형인 두 column으로 heatmap 그리기
2019.04.29 jupyter notebook 에서 화면 폭
2019.04.22 python 에서 gc (garbage collection)
2019.04.17 jupyter notebook에서 dataframe
2018.05.28 Softmax를 계산할 때 max 값을 빼는 이유.
2018.04.27 [tensorflow] 설치
2018.04.20 [Python] Pandas Dataframe에서 한 컬럼이 comma로 구분된 여러 값을 가질 경우
2018.04.19 [Spark] dataframe을 CSV로 저장하기

Pandas, 실수형인 두 column으로 heatmap 그리기

카테고리 없음 2019. 6. 5. 14:46

iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
iris_df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
iris_df.head(5)

df = iris_df[["sepal_length", "sepal_width"]]
display(df.head(5))
df = round(df)
df2 = df.groupby(["sepal_length", "sepal_width"]).size()
df3= df2.unstack()
display(df3)

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(5, 5))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(100, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(df3, cmap=cmap, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

Posted by poterius

jupyter notebook 에서 화면 폭

카테고리 없음 2019. 4. 29. 16:17

jupyter notebook에서 화면 폭을 넓게 쓰려면,

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

Posted by poterius

python 에서 gc (garbage collection)

카테고리 없음 2019. 4. 22. 21:09

Python도 GC를 지원한다.

큰 데이터를 여러번 반복해서 읽다 메모리 에러나는 경우 gc를 고려해 본다.

사용하지 않는 변수는 명시적으로 표시해 줘야 되는 것 같다.

https://docs.python.org/3/library/gc.html

gc — Garbage Collector interface — Python 3.7.3 documentation

gc — Garbage Collector interface This module provides an interface to the optional garbage collector. It provides the ability to disable the collector, tune the collection frequency, and set debugging options. It also provides access to unreachable objects

docs.python.org

Posted by poterius

jupyter notebook에서 dataframe

카테고리 없음 2019. 4. 17. 17:29

dataframe이 notebook cell의 마지막에 있으면 내용을 예쁘게 table로 보여주는데,

loop안에 있을 경우 table이 보이지 않는다.

print(df)하면 내용은 보이기는 하나 예쁜 table이 아니고..

이경우 IPython.display를 사용한다.

from sklearn.datasets import load_iris
import pandas as pd
from IPython.display import display, HTML

iris = datasets.load_iris()

data = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])
target = data['target'].unique()
target
for t in target:
    display(data[data['target'] == t].head(2))

Posted by poterius

Softmax를 계산할 때 max 값을 빼는 이유.

카테고리 없음 2018. 5. 28. 13:30

Softmax를 계산할 때 max 값을 빼는 이유.

참고 : 원문 : https://jamesmccaffrey.wordpress.com/2016/03/04/the-max-trick-when-computing-softmax/

softmax 식은 아래와 같은데,

실제 구현을 보면, 아래와 같이 max값을 빼서 처리하는 것을 볼 수 있다.

이유는 e 가 1보다 크기 때문에 지수승을 계속하면 매우 큰 수가 되어 계산에 문제가 되기 때문이라고 한다.

결과.

-max를 하지 않으면 값이 2000만 되어도 nan 처리된다.

x : [1. 1. 2.]
softmax with -max :  [0.21194156 0.21194156 0.57611688] 1.0
softmax without -max :  [0.21194156 0.21194156 0.57611688] 1.0
x : [ 1.  1. 20.]
softmax with -max :  [5.60279637e-09 5.60279637e-09 9.99999989e-01] 1.0
softmax without -max :  [5.60279637e-09 5.60279637e-09 9.99999989e-01] 1.0
x : [  1.   1. 200.]
softmax with -max :  [3.76182078e-87 3.76182078e-87 1.00000000e+00] 1.0
softmax without -max :  [3.76182078e-87 3.76182078e-87 1.00000000e+00] 1.0
x : [1.e+00 1.e+00 2.e+03]
softmax with -max :  [0. 0. 1.] 1.0
softmax without -max :  [ 0.  0. nan] nan
x : [1.e+00 1.e+00 2.e+04]
softmax with -max :  [0. 0. 1.] 1.0
softmax without -max :  [ 0.  0. nan] nan

Posted by poterius

[tensorflow] 설치

카테고리 없음 2018. 4. 27. 15:31

참고

tensorflow site : https://www.tensorflow.org/install/

docker로 설치하기 : https://hub.docker.com/r/tensorflow/tensorflow/

python3를 사용하고 싶으면

docker pull tensorflow/tensorflow:latest-py3

Posted by poterius

[Python] Pandas Dataframe에서 한 컬럼이 comma로 구분된 여러 값을 가질 경우

카테고리 없음 2018. 4. 20. 20:34

docs 컬럼은 콤마로 구분된 여러 값을 가지고 있다.

이것을 배열처럼 생각할 때 특정 값이 몇 번째에 나타나는지를 알고 싶을 경우.

df['docs'] = df['docs'].str.split(",")
df['pos'] = df['docs'].apply(lambda x: -1 if not TARGET in x else x.index(TARGET)

split()으로 생성된 list의 element와 실제 값을 찾을 때 사용할 TARGET의 type을 잘 맞춰야 한다.

list의 element는 str 이고, TARGET이 int 인 경우 값이 있음에도 찾지 못하고 속도도 느려진다.

이와 같이 숫자로 구성된 필드는 찾기 어려운 문제가 될 것 같다.

Posted by poterius

[Spark] dataframe을 CSV로 저장하기

카테고리 없음 2018. 4. 19. 17:14

Spark에서 dataframe의 내용을 CSV 파일로 저장하기

val select_SQL = s"""
    select dt
        , query
     ...
  """

logger.info("select SQL : " + select_SQL)
val df = spark.sql(select_SQL1)

logger.info("#row : " + df.count())
df.coalesce(1)
  .write.mode(SaveMode.Overwrite)
  .option("header", "true")
  .format("com.databricks.spark.csv")
  .save("output_folder")

spark-shell에서 실행하더라도 file을 HDFS(/home/{user}/output_folder)에 만들어 진다.

참고

http://americanopeople.tistory.com/93

https://stackoverflow.com/questions/49102292/file-already-exists-error-writing-new-files-from-dataframe

Posted by poterius

이전 1 2 다음

서서히 그러다 갑자기

'분류 전체보기'에 해당되는 글 18건

Pandas, 실수형인 두 column으로 heatmap 그리기

jupyter notebook 에서 화면 폭

python 에서 gc (garbage collection)

jupyter notebook에서 dataframe

Softmax를 계산할 때 max 값을 빼는 이유.

[tensorflow] 설치

[Python] Pandas Dataframe에서 한 컬럼이 comma로 구분된 여러 값을 가질 경우

[Spark] dataframe을 CSV로 저장하기

카테고리

태그목록

최근에 올라온 글

최근에 달린 댓글

글 보관함

달력

링크

티스토리툴바


	by poterius

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31