IT story

일반적으로 어떤 머신 러닝 분류기를 선택해야합니까?

hot-time 2020. 5. 10. 10:23
반응형

일반적으로 어떤 머신 러닝 분류기를 선택해야합니까? [닫은]


분류 문제를 해결하고 있다고 가정합니다. (사기 탐지 및 댓글 스팸은 현재 진행중인 두 가지 문제이지만 일반적으로 분류 작업이 궁금합니다.)

어떤 분류기를 사용해야하는지 어떻게 알 수 있습니까?

  1. 의사 결정 트리
  2. SVM
  3. 베이지안
  4. 신경망
  5. K- 최근 접 이웃
  6. Q- 러닝
  7. 유전자 알고리즘
  8. 마르코프 의사 결정 프로세스
  9. 컨볼 루션 신경망
  10. 선형 회귀 또는 로지스틱 회귀
  11. 부스팅, 배깅, 샘 블링
  12. 임의의 언덕 등반 또는 모의 어닐링
  13. ...

어떤 경우에 "자연적인"첫 번째 선택이 있으며, 그 중 하나를 선택하기위한 원칙은 무엇입니까?

내가 찾고있는 답변 유형의 예 (Manning et al. 's Introduction to Information Retrieval book) :

ㅏ. 데이터에 레이블이 지정되어 있지만 수량이 한정되어있는 경우 바이어스가 높은 분류기를 사용해야합니다 (예 : Naive Bayes) .

바이어스가 높을수록 분산이 낮기 때문에 적은 양의 데이터로 인해 좋을 것입니다.

비. 많은 양의 데이터가 있다면 분류자는 그다지 중요하지 않으므로 확장 성이 좋은 분류기를 선택해야합니다.

  1. 다른 지침은 무엇입니까? "상위 관리 담당자에게 모델을 설명해야하는 경우 의사 결정 규칙이 상당히 투명하므로 의사 결정 트리를 사용해야합니다"와 같은 대답도 좋습니다. 그래도 구현 / 라이브러리 문제에 대해서는 신경 쓰지 않습니다.

  2. 또한 표준 Bayesian 분류 자 ​​외에 다소 별도의 질문이있는 경우 스팸 스팸 탐지에 대한 '표준 최첨단'방법이 있습니까 (이메일 스팸이 아닌)?


여기에 이미지 설명을 입력하십시오

우선, 문제를 식별해야합니다. 어떤 종류의 데이터가 있고 원하는 작업이 무엇인지에 따라 다릅니다.

당신이 있다면 Predicting Category:

  • 당신은 Labeled Data
    • 따라야 Classification Approach하고 알고리즘
  • 당신은 가지고 있지 않습니다 Labeled Data
    • 당신은 갈 필요가 Clustering Approach

당신이 있다면 Predicting Quantity:

  • 당신은 갈 필요가 Regression Approach

그렇지 않으면

  • 당신은 갈 수 있습니다 Dimensionality Reduction Approach

위에서 언급 한 각 접근 방식마다 다른 알고리즘이 있습니다. 특정 알고리즘의 선택은 데이터 세트의 크기에 따라 다릅니다.

출처 : http://scikit-learn.org/stable/tutorial/machine_learning_map/


교차 검증을 사용한 모델 선택 이 필요할 수 있습니다.

교차 검증

What you do is simply to split your dataset into k non-overlapping subsets (folds), train a model using k-1 folds and predict its performance using the fold you left out. This you do for each possible combination of folds (first leave 1st fold out, then 2nd, ... , then kth, and train with the remaining folds). After finishing, you estimate the mean performance of all folds (maybe also the variance/standard deviation of the performance).

How to choose the parameter k depends on the time you have. Usual values for k are 3, 5, 10 or even N, where N is the size of your data (that's the same as leave-one-out cross validation). I prefer 5 or 10.

Model selection

Let's say you have 5 methods (ANN, SVM, KNN, etc) and 10 parameter combinations for each method (depending on the method). You simply have to run cross validation for each method and parameter combination (5 * 10 = 50) and select the best model, method and parameters. Then you re-train with the best method and parameters on all your data and you have your final model.

There are some more things to say. If, for example, you use a lot of methods and parameter combinations for each, it's very likely you will overfit. In cases like these, you have to use nested cross validation.

Nested cross validation

In nested cross validation, you perform cross validation on the model selection algorithm.

Again, you first split your data into k folds. After each step, you choose k-1 as your training data and the remaining one as your test data. Then you run model selection (the procedure I explained above) for each possible combination of those k folds. After finishing this, you will have k models, one for each combination of folds. After that, you test each model with the remaining test data and choose the best one. Again, after having the last model you train a new one with the same method and parameters on all the data you have. That's your final model.

Of course, there are many variations of these methods and other things I didn't mention. If you need more information about these look for some publications about these topics.


The book "OpenCV" has a great two pages on this on pages 462-463. Searching the Amazon preview for the word "discriminative" (probably google books also) will let you see the pages in question. These two pages are the greatest gem I have found in this book.

In short:

  • Boosting - often effective when a large amount of training data is available.

  • Random trees - often very effective and can also perform regression.

  • K-nearest neighbors - simplest thing you can do, often effective but slow and requires lots of memory.

  • Neural networks - Slow to train but very fast to run, still optimal performer for letter recognition.

  • SVM - Among the best with limited data, but losing against boosting or random trees only when large data sets are available.


Things you might consider in choosing which algorithm to use would include:

  1. Do you need to train incrementally (as opposed to batched)?

    If you need to update your classifier with new data frequently (or you have tons of data), you'll probably want to use Bayesian. Neural nets and SVM need to work on the training data in one go.

  2. Is your data composed of categorical only, or numeric only, or both?

    I think Bayesian works best with categorical/binomial data. Decision trees can't predict numerical values.

  3. Does you or your audience need to understand how the classifier works?

    Use Bayesian or decision trees, since these can be easily explained to most people. Neural networks and SVM are "black boxes" in the sense that you can't really see how they are classifying data.

  4. How much classification speed do you need?

    SVM's are fast when it comes to classifying since they only need to determine which side of the "line" your data is on. Decision trees can be slow especially when they're complex (e.g. lots of branches).

  5. Complexity.

    Neural nets and SVMs can handle complex non-linear classification.


As Prof Andrew Ng often states: always begin by implementing a rough, dirty algorithm, and then iteratively refine it.

For classification, Naive Bayes is a good starter, as it has good performances, is highly scalable and can adapt to almost any kind of classification task. Also 1NN (K-Nearest Neighbours with only 1 neighbour) is a no-hassle best fit algorithm (because the data will be the model, and thus you don't have to care about the dimensionality fit of your decision boundary), the only issue is the computation cost (quadratic because you need to compute the distance matrix, so it may not be a good fit for high dimensional data).

Another good starter algorithm is the Random Forests (composed of decision trees), this is highly scalable to any number of dimensions and has generally quite acceptable performances. Then finally, there are genetic algorithms, which scale admirably well to any dimension and any data with minimal knowledge of the data itself, with the most minimal and simplest implementation being the microbial genetic algorithm (only one line of C code! by Inman Harvey in 1996), and one of the most complex being CMA-ES and MOGA/e-MOEA.

And remember that, often, you can't really know what will work best on your data before you try the algorithms for real.

As a side-note, if you want a theoretical framework to test your hypothesis and algorithms theoretical performances for a given problem, you can use the PAC (Probably approximately correct) learning framework (beware: it's very abstract and complex!), but to summary, the gist of PAC learning says that you should use the less complex, but complex enough (complexity being the maximum dimensionality that the algo can fit) algorithm that can fit your data. In other words, use the Occam's razor.


Sam Roweis used to say that you should try naive Bayes, logistic regression, k-nearest neighbour and Fisher's linear discriminant before anything else.


My take on it is that you always run the basic classifiers first to get some sense of your data. More often than not (in my experience at least) they've been good enough.

So, if you have supervised data, train a Naive Bayes classifier. If you have unsupervised data, you can try k-means clustering.

Another resource is one of the lecture videos of the series of videos Stanford Machine Learning, which I watched a while back. In video 4 or 5, I think, the lecturer discusses some generally accepted conventions when training classifiers, advantages/tradeoffs, etc.


You should always keep into account the inference vs prediction trade-off.

If you want to understand the complex relationship that is occurring in your data then you should go with a rich inference algorithm (e.g. linear regression or lasso). On the other hand, if you are only interested in the result you can go with high dimensional and more complex (but less interpretable) algorithms, like neural networks.


Selection of Algorithm is depending upon the scenario and the type and size of data set. There are many other factors.

This is a brief cheat sheet for basic machine learning.

간단한 치트 시트로 시나리오를 검증 할 수 있습니다.


First of all, it depends on which type of problem you're dealing with whether it's classification or regression. Then choose your model wisely. It depends on a particular model one specific model outperform other models. Suppose you are working on wine_dataset from sklearn library and first you tried to train the data with kernel svm with linear and you get some sort of accuracy and then you think that it is not satisfying so then you tried to train your data with DecisionTreeClassifier() and then you tried with RandomForestClassifier(). After that whichever accuracy will be better or you can say that fits your data, you can conclude that. There is so small syntactical difference you will find while keep on changing model for testing. So all best and understand the problem well.

참고 URL : https://stackoverflow.com/questions/2595176/which-machine-learning-classifier-to-choose-in-general

반응형