'I/O' 태그의 글 목록

I/O

[dataload] sklearn의 datasets의 함수 정리 2015.01.30

[dataload] sklearn의 datasets의 함수 정리

2015. 1. 30. 17:28

 Decision Tree로 classification을 수행하기 위해서는 Tree에 data를 preprocessing하여 전달해야 한다. 아래 그림은 http://scikit-learn.org/stable/modules/tree.html page에서 예시로 들고 있는 decision tree의 예이다. 이 post에서는 load_iris() 함수가 어떤식으로 tree에 data를 전달하는지 살펴보려고 한다.
>>> from sklearn.datasets import load_iris
>>> from sklearn import tree
>>> iris = load_iris()
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(iris.data, iris.target)
def load_iris():

    """Load and return the iris dataset (classification).
    The iris dataset is a classic and very easy multi-class classification
    dataset.
    ===============================
    Classes                          3
    Samples per class          50
    Samples total                150
    Dimensionality                4
    Features            real, positive
    ===============================
    Returns
    -------
    data : Bunch
        Dictionary-like object, the interesting attributes are:
        'data', the data to learn, 'target', the classification labels,
        'target_names', the meaning of the labels, 'feature_names', the
        meaning of the features, and 'DESCR', the
        full description of the dataset.

    Examples
    --------
    Let's say you are interested in the samples 10, 25, and 50, and want to
    know their class name.
    >>> from sklearn.datasets import load_iris
    >>> data = load_iris()
    >>> data.target[[10, 25, 50]]
    array([0, 0, 1])
    >>> list(data.target_names)
    ['setosa', 'versicolor', 'virginica']
    """
    module_path = dirname(__file__)
    with open(join(module_path, 'data', 'iris.csv')) as csv_file:
        data_file = csv.reader(csv_file)
        temp = next(data_file)
        n_samples = int(temp[0])
        n_features = int(temp[1])
        target_names = np.array(temp[2:])
        data = np.empty((n_samples, n_features))
        target = np.empty((n_samples,), dtype=np.int)
#########################################
 With와 as 구문을 사용하여 .csv 파일을 여는 구문이다. join 함수는 나중에 다루기로 하자. 여기서 next는 읽어온 파일의 첫 번째 줄을 읽어오는 기능을 하는 함수이다. 위의 코드를 보면 n_samples, n_features에 첫 줄의 첫 번째, 두 번째 값을 저장하고 있다. 참고로 iris.csv의 첫 번째 줄은 150,4,setosa,versicolor,virginica 이다. Sample의 갯수 150개, feature의 dimension은 4, 그리고 class들의 이름이 나타나있다. numpy (as np)를 이용하여 data[150 X 4], target[150] 배열을 만든다.
#########################################
        for i, ir in enumerate(data_file):
            data[i] = np.asarray(ir[:-1], dtype=np.float)
            target[i] = np.asarray(ir[-1], dtype=np.int)

    with open(join(module_path, 'descr', 'iris.rst')) as rst_file:
        fdescr = rst_file.read()

    return Bunch(data=data, target=target,
                 target_names=target_names,
                 DESCR=fdescr,
                 feature_names=['sepal length (cm)', 'sepal width (cm)',
                                'petal length (cm)', 'petal width (cm)'])
#########################################
 enumerate를 이용해서 data, target array에 값을 넣어준다. data_file의 경우 위에서 next(data_file) 처리를 했기 때문에 가장 첫 행은 이 반복문에 포함되지 않는다. 마지막에는 Bunch 형태로 return 한다.

PREV 1 NEXT

언제나 당신의 열정이 곧 당신의 결정

I/O

[dataload] sklearn의 datasets의 함수 정리

+ Recent posts

티스토리툴바