*This article was published as a part of the Data Science Blogathon.*

## Introduction

If you have experience in Machine Learning, specifically supervised learning, you should have known that hyper parameter-tuning is an important process to improve model accuracy. This process tunes hyperparameters in a Machine Learning algorithm.

As we have known, every algorithm requires input parameters from observed data and hyperparameters are not from the observed data. Hyperparameters are different for each algorithm. If we do not tune the hyperparameters, the default hyperparameters are used.

There are many ways to do hyper parameter-tuning. This article will later focus on Bayesian Optimization as this is my favorite. There are 2 packages that I usually use for Bayesian Optimization. They are “bayes_opt” and “hyperopt” (Distributed Asynchronous Hyper-parameter Optimization). We will simply compare the two in terms of the time to run, accuracy, and output.

But before that, we will discuss some basic knowledge of hyperparameter-tuning.

## Hyperparameter-tuning

Hyperparameter-tuning is the process of searching the most accurate hyperparameters for a dataset with a Machine Learning algorithm. To do this, we fit and evaluate the model by changing the hyperparameters one by one repeatedly until we find the best accuracy.

The search methods can be uninformed search and informed search. Uninformed search tries sets of hyperparameters repeatedly and independently. Each search does not inform or suggest the other searches. Examples of uninformed search are GridSearchCV and RandomizedSearchCV.

## Hyper Parameter tuning Using GridSearchCV

Now, I want to perform hyperparameter-tuning on GradientBoostingClassifier. The dataset is from Kaggle competition. The hyperparameters to tune are “max_depth”, “max_features”, “learning_rate”, “n_estimators”, and “subsample”.

Note that as mentioned above, these hyperparameters are only for GradientBoostingClassifier, not for the other algorithms. The accuracy metric is the accuracy score. I will run 5 fold cross-validation.

Below is the code for GridSearchCV. We can see that the value options for each hyperparameter are set in the “param_grid”.

For example, the GridSearchCV will try to run with n_estimators of 80, 100, and so on until 150. To know how many times the GridSearchCV will run, just multiply the number of value options in each hyperparameter with one another. It will be 8 x 3 x 3 x 5 x 3 = 1080. And for each of the 1080 GridSearchCV, there will be 5 fold cross-validation. That makes 1080 x 5 = 5400 models should be built to find which is the best.

# Load packagesfrom scipy.stats import uniformfrom sklearn.model_selection import cross_val_scorefrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.model_selection import GridSearchCV# GridSearchCVparam_grid = {'max_depth':[3,4,5,6,7,8,9,10], 'max_features':[0.8,0.9,1], 'learning_rate':[0.01,0.1,1], 'n_estimators':[80,100,120,140,150], 'subsample': [0.8,0.9,1]}grid = GridSearchCV(estimator=GradientBoostingClassifier(), param_grid=param_grid, scoring=acc_score, cv=5)grid.fit(X_train.iloc[1:100,], y_train.iloc[1:100,])

### Disadvantages

The disadvantage of this method is that we can miss good hyperparameter values not set in the beginning. For instance, we do not set an option for the max_features to be 0.85 or the learning_rate to be 0.05. We do not know if that combination can give better accuracy. To overcome this, we can try RandomizedSearchCV.

## Hyper Parameter-Tuning Using RandomizedSearchCV

Below is the code for that. Examine that the code sets a range of possible values for each hyperparameter. For example, the learning_rate can have any values from 0.01 to 1 distributed uniformly.

from sklearn.model_selection import RandomizedSearchCV# RandomizedSearhCVparam_rand = {'max_depth':uniform(3,10), 'max_features':uniform(0.8,1), 'learning_rate':uniform(0.01,1), 'n_estimators':uniform(80,150), 'subsample':uniform(0.8,1)}rand = RandomizedSearchCV(estimator=GradientBoostingClassifier(), param_distributions=param_rand, scoring=acc_score, cv=5)rand.fit(X_train.iloc[1:100,], y_train.iloc[1:100,])

### Problem With Uninformed Search

The problem with uninformed search is that it takes relatively a long time to build all the models. Informed search can solve this problem. In informed search, the previous models with a certain set of hyperparameter values can inform the later model which hyperparameter values better to select.

One of the methods to do this is coarse-to-fine. This involves running GridSearchCV or RandomizedSearchCV more than once. Each time, the hyperparameter value range is more specific.

For example, we start RandomizedSearchCV with learning_rate ranging from 0.01 to 1. Then, we find out that high accuracy models have their learning_rate around 0.1 to 0.3. Hence, we can run again GridSearchCV focusing on the learning_rate between 0.1 and 0.3. This process can continue until a satisfactory result is achieved. The first trial is coarse because the value range is large, from 0.01 to 1. The later trial is fine as the value range is focused on 0.1 to 0.3.

The drawback of the coarse-to-fine method is that we need to run the code repeatedly and observe the value range of hyperparameters-tuning. You might be thinking if there is a way to automate this. Yes, that is why my favorite is Bayesian Optimization.

## Baysian Optimization

Bayesian Optimization also runs models many times with different sets of hyperparameter values, but it evaluates the past model information to select hyperparameter values to build the newer model. This is said to spend less time to reach the highest accuracy model than the previously discussed methods.

### bayes_opt

As mentioned in the beginning, there are two packages in python that I usually use for Bayesian Optimization. The first one is bayes_opt. Here is the code to run it.

from bayes_opt import BayesianOptimization# Gradient Boosting Machinedef gbm_cl_bo(max_depth, max_features, learning_rate, n_estimators, subsample): params_gbm = {} params_gbm['max_depth'] = round(max_depth) params_gbm['max_features'] = max_features params_gbm['learning_rate'] = learning_rate params_gbm['n_estimators'] = round(n_estimators) params_gbm['subsample'] = subsample scores = cross_val_score(GradientBoostingClassifier(random_state=123, **params_gbm), X_train, y_train, scoring=acc_score, cv=5).mean() score = scores.mean() return score# Run Bayesian Optimizationstart = time.time()params_gbm ={ 'max_depth':(3, 10), 'max_features':(0.8, 1), 'learning_rate':(0.01, 1), 'n_estimators':(80, 150), 'subsample': (0.8, 1)}gbm_bo = BayesianOptimization(gbm_cl_bo, params_gbm, random_state=111)gbm_bo.maximize(init_points=20, n_iter=4)print('It takes %s minutes' % ((time.time() - start)/60))

output:

| iter | target | learni... | max_depth | max_fe... | n_esti... | subsample |-------------------------------------------------------------------------------------| 1 | 0.7647 | 0.616 | 4.183 | 0.8872 | 133.8 | 0.8591 || 2 | 0.7711 | 0.1577 | 3.157 | 0.884 | 96.71 | 0.8675 || 3 | 0.7502 | 0.9908 | 4.664 | 0.8162 | 126.9 | 0.9242 || 4 | 0.7681 | 0.2815 | 6.264 | 0.8237 | 85.18 | 0.9802 || 5 | 0.7107 | 0.796 | 8.884 | 0.963 | 149.4 | 0.9155 || 6 | 0.7442 | 0.8156 | 5.949 | 0.8055 | 111.8 | 0.8211 || 7 | 0.7286 | 0.819 | 7.884 | 0.9131 | 99.2 | 0.9997 || 8 | 0.7687 | 0.1467 | 7.308 | 0.897 | 108.4 | 0.9456 || 9 | 0.7628 | 0.3296 | 5.804 | 0.8638 | 146.3 | 0.9837 || 10 | 0.7668 | 0.8157 | 3.239 | 0.9887 | 146.5 | 0.9613 || 11 | 0.7199 | 0.4865 | 9.767 | 0.8834 | 102.3 | 0.8033 || 12 | 0.7708 | 0.0478 | 3.372 | 0.8256 | 82.34 | 0.8453 || 13 | 0.7679 | 0.5485 | 4.25 | 0.8359 | 90.47 | 0.9366 || 14 | 0.7409 | 0.4743 | 8.378 | 0.9338 | 110.9 | 0.919 || 15 | 0.7216 | 0.467 | 9.743 | 0.8296 | 143.5 | 0.8996 || 16 | 0.7306 | 0.5966 | 7.793 | 0.8355 | 140.5 | 0.8964 || 17 | 0.772 | 0.07865 | 5.553 | 0.8723 | 113.0 | 0.8359 || 18 | 0.7589 | 0.1835 | 9.644 | 0.9311 | 89.45 | 0.9856 || 19 | 0.7662 | 0.8434 | 3.369 | 0.8407 | 141.1 | 0.9348 || 20 | 0.7566 | 0.3043 | 8.141 | 0.9237 | 94.73 | 0.9604 || 21 | 0.7683 | 0.02841 | 9.546 | 0.9055 | 140.5 | 0.8805 || 22 | 0.7717 | 0.05919 | 4.285 | 0.8093 | 92.7 | 0.9528 || 23 | 0.7676 | 0.1946 | 7.351 | 0.9804 | 108.3 | 0.929 || 24 | 0.7602 | 0.7131 | 5.307 | 0.8428 | 91.74 | 0.9193 |=====================================================================================It takes 20.90080655813217 minutes

params_gbm = gbm_bo.max['params']params_gbm['max_depth'] = round(params_gbm['max_depth'])params_gbm['n_estimators'] = round(params_gbm['n_estimators'])params_gbm

Output:

{'learning_rate': 0.07864837617488214, 'max_depth': 6, 'max_features': 0.8723008386644597, 'n_estimators': 113, 'subsample': 0.8358969695415375}

The package bayes_opt takes 20 minutes to build 24 models. The best accuracy is 0.772.

### hyperopt

Another package is hyperopt. Here is the code.

from hyperopt import hp, fmin, tpe# Run Bayesian Optimization from hyperoptstart = time.time()space_lr = {'max_depth': hp.randint('max_depth', 3, 10), 'max_features': hp.uniform('max_features', 0.8, 1), 'learning_rate': hp.uniform('learning_rate',0.01, 1), 'n_estimators': hp.randint('n_estimators', 80,150), 'subsample': hp.uniform('subsample',0.8, 1)}def gbm_cl_bo2(params): params = {'max_depth': params['max_depth'], 'max_features': params['max_features'], 'learning_rate': params['learning_rate'], 'n_estimators': params['n_estimators'], 'subsample': params['subsample']} gbm_bo2 = GradientBoostingClassifier(random_state=111, **params) best_score = cross_val_score(gbm_bo2, X_train, y_train, scoring=acc_score, cv=5).mean() return 1 - best_scoregbm_best_param = fmin(fn=gbm_cl_bo2, space=space_lr, max_evals=24, rstate=np.random.RandomState(42), algo=tpe.suggest)print('It takes %s minutes' % ((time.time() - start)/60))

Output:

100%|██████████| 24/24 [19:53<00:00, 49.74s/trial, best loss: 0.22769091027055077]It takes 19.897333371639252 minutes

gbm_best_param

Output:

{'learning_rate': 0.03516615427790515, 'max_depth': 6, 'max_features': 0.8920776081423815, 'n_estimators': 148, 'subsample': 0.9981549036976672}

The package hyperopt takes 19.9 minutes to run 24 models. The best loss is 0.228. It means that the best accuracy is 1 – 0.228 = 0.772. The duration to run bayes_opt and hyperopt is almost the same.

The accuracy is also almost the same although the results of the best hyperparameters are different. But, there is another difference. The package bayes_opt shows the process of tuning the values of the hyperparameters. We can see which values are used for each iteration. The package hyperopt only shows one line of the progress bar, best loss, and duration.

In my opinion, I prefer bayes_opt because, in reality, we may feel the tuning process takes too long time and just want to terminate the process. After stopping the processing, we just want to take the best hyperparameter-tuning result. We can do that with bayes_opt, but not with hyperopt.

There are still other ways of automatic hyperparameter-tuning. Not only the hyperparameter-tuning, but choosing the Machine Learning algorithms also can be automated. I will discuss that next time. The above code is available here.

**About Author**

Connect with me here.

*The media shown in this article on Top Machine Learning Libraries in Julia are not owned by Analytics Vidhya and is used at the Author’s discretion.*

Bayesian optimizationblogathonhyperparameter tuningML

## FAQs

### Does Hyperopt use Bayesian optimization? ›

**HyperOpt is based on Bayesian Optimization** supported by a SMBO methodology adapted to work with different algorithms such as: Tree of Parzen Estimators (TPE), Adaptive Tree of Parzen Estimators (ATPE) and Gaussian Processes (GP) [5].

### Is Bayesian optimization better than grid search? ›

**The Bayesian optimization also performed 100 trials but was able to achieve the highest score after only 67 iterations, far less than the grid search's 680 iterations**. Although it executed the same number of trials as the random search, it has a longer run time since it is an informed search method.

### Is Hyperopt better than grid search? ›

Using Hyperopt, Optuna, and Ray Tune to Accelerate Machine Learning Hyperparameter Optimization. Bayesian optimization of machine learning model **hyperparameters works faster and better than grid search**.

### Is Bayesian optimization faster than random search? ›

Bayesian optimization methods are efficient because they select hyperparameters in an informed manner. By prioritizing hyperparameters that appear more promising from past results, **Bayesian methods can find the best hyperparameters in lesser time (in fewer iterations) than both grid search and random search**.

### Is Optuna better than random search? ›

**Both Optuna and Hyperopt improved over the random search** which is good. TPE implementation from Optuna was slightly better than Hyperopt's Adaptive TPE but not by much. On the other hand, when running hyperparameter optimization, those small improvements are exactly what you are going for.

### What is Bayesian optimization used for? ›

Bayesian Optimization is an approach that uses Bayes Theorem **to direct the search in order to find the minimum or maximum of an objective function**. It is an approach that is most useful for objective functions that are complex, noisy, and/or expensive to evaluate.

### What is TPE algorithm? ›

Tree-Structured Parzen Estimator (TPE) algorithm is **designed to optimize quantization hyperparameters to find quantization configuration that achieve an expected accuracy target and provide best possible latency improvement**.

### What is hyperparameter in Bayesian? ›

In Bayesian statistics, a hyperparameter is **a parameter of a prior distribution**; the term is used to distinguish them from parameters of the model for the underlying system under analysis.

### Which is a hyperparameter tuning method? ›

...

**Hyperparameter Optimization Checklist:**

- Manual Search.
- Grid Search.
- Randomized Search.
- Halving Grid Search.
- Halving Randomized Search.
- HyperOpt-Sklearn.
- Bayes Search.

### Is Optuna faster than GridSearch? ›

Optuna enables users to adopt state-of-the-art algorithms for sampling hyperparameters and pruning unpromising trials. This **helps to speed up optimization time and performance greatly compared to traditional methods such as GridSearch**.

### How do I speed up my Hyperopt? ›

You can speed up the process significantly by **using Google Colab's GPU resources**. The actual code you need is straightforward. We set the trials variable so that we can retrieve the data from the optimization, and then use the fmin() function to actually run the optimization.

### What is FMIN in Hyperopt? ›

fmin() . **It covers how to write an objective function that fmin can optimize, and how to describe a search space that fmin can search**. Hyperopt's job is to find the best value of a scalar-valued, possibly-stochastic function over a set of possible arguments to that function.

### What is the difference between GridSearchCV and RandomizedSearchCV? ›

The only difference between both the approaches is **in grid search we define the combinations and do training of the model whereas in RandomizedSearchCV the model selects the combinations randomly**. Both are very effective ways of tuning the parameters that increase the model generalizability.

### How do I choose a good hyperparameter? ›

There are often general heuristics or rules of thumb for configuring hyperparameters. A better approach is to **objectively search different values for model hyperparameters and choose a subset that results in a model that achieves the best performance on a given dataset**.

### How do I get the best value in hyperparameter? ›

One traditional and popular way to perform hyperparameter tuning is by **using an Exhaustive Grid Search from Scikit learn**. This method tries every possible combination of each set of hyper-parameters. Using this method, we can find the best set of values in the parameter search space.

### Which is better optuna or HyperOpt? ›

From the experiments, we found that **Optuna has better performance for CASH problem and HyperOpt for MLP problem**.

### Which algorithm does optuna use? ›

Optuna implements sampling algorithms such as Tree-Structured of Parzen Estimator (TPE) [7, 8] for independent parameter sampling as well as Gaussian Processes (GP) [8] and Covariance Matrix Adaptation (CMA) [9] for relational parameter sampling which aims to exploit the correlation between parameters.

### Is Bayesian optimization a Gaussian process? ›

Bayesian optimization uses a stochastic model of the objective function in order to find promising parameter values. **The most commonly applied model is a Gaussian process**.

### How do you do Bayesian optimization? ›

- Choosing the search space. Bayesian Optimisation operates along probability distributions for each parameter that it will sample from. ...
- Objective function. The objective function (1) serves as the main evaluator of hyperparameter combinations. ...
- Surrogate function and selection function.

### Is Bayesian optimization deterministic? ›

The Bayesian optimization algorithm attempts to minimize a scalar objective function f(x) for x in a bounded domain. **The function can be deterministic or stochastic**, meaning it can return different results when evaluated at the same point x.

### Is Tpe a Bayesian? ›

**The Tree Parzen Estimator is one algorithm that uses Bayesian reasoning** to construct the surrogate model and can select the next hyperparameters using Expected Improvement. There are a number of libraries to implement SMBO in Python which we will explore in further articles.

### What is HyperOpt? ›

HyperOpt is **an open-source Python library for Bayesian optimization developed by James Bergstra**. It is designed for large-scale optimization for models with hundreds of parameters and allows the optimization procedure to be scaled across multiple cores and multiple machines.

### What is TPE in machine learning? ›

The Tree-structured Parzen Estimator (TPE) is **a sequential model-based optimization (SMBO) approach**. SMBO methods sequentially construct models to approximate the performance of hyperparameters based on historical measurements, and then subsequently choose new hyperparameters to test based on this model.

### What are hyperparameters examples? ›

...

**Examples of hyperparameters in machine learning include:**

- Model architecture.
- Learning rate.
- Number of epochs.
- Number of branches in a decision tree.
- Number of clusters in a clustering algorithm.

### What are hyperparameters in decision tree? ›

Hyperparameters Regularizations:

Such types of models are often called **non-parametric models**. Because the number of parameters is not determined before the training, so our model is free to fit closely with the training data. To avoid this overfitting you need to constraint the decision trees during training.

### What are hyper parameters in ML? ›

In machine learning, a hyperparameter is **a parameter whose value is used to control the learning process**. By contrast, the values of other parameters (typically node weights) are derived via training.

### Does hyperparameter tuning improve accuracy? ›

However, **hyperparameter values when set right can build highly accurate models**, and thus we allow our models to try different combinations of hyperparameters during the training process and make predictions with the best combination of hyperparameter values.

### Why hyper parameter tuning is important? ›

Hyperparameter tuning takes advantage of the processing infrastructure of Google Cloud to test different hyperparameter configurations when training your model. **It can give you optimized values for hyperparameters, which maximizes your model's predictive accuracy**.

### Which dataset you would use for hyperparameter tuning? ›

Hyperparameter tuning is a final step in the process of applied machine learning before presenting results. You will use the **Pima Indian diabetes dataset**.

### Can hyperparameter tuning decrease accuracy? ›

I was participating one of the machine learning Hackathon. When I ran plain RandomForest model it gave me accuracy around 75%. But **when I tuned hyper parameters with RandomSearchCV the accuracy reduced to 45%**.

### How do you run optuna in parallel? ›

...

**If you want to manually execute Optuna optimization:**

- start an RDB server (this example uses MySQL)
- create a study with –storage argument.
- share the study among multiple nodes and processes.

### What tool helps you to optimize the result of a model? ›

**AI Platform Vizier** is a black-box optimization service for tuning hyperparameters in complex machine learning models. It not only optimizes your model's output by tuning the hyperparameters, it can also be used effectively to tune parameters in a function.

### What is trials in Hyperopt? ›

As per Hyperopt code: `Trials - **a list of documents including at least sub-documents** ['spec'] - the specification of hyper-parameters for a job ['result'] - the result of Domain.evaluate().

### What is HP uniform? ›

hp. uniform is **a built-in hyperopt function that takes three parameters: the name, x , and the lower and upper bound of the range, 0 and 1** . The parameter algo takes a search algorithm, in this case tpe which stands for tree of Parzen estimators.

### Which of the following Hyperparameter tuning techniques is computationally expensive? ›

For detailed analysis of Grid vs Random, please refer this paper. Even though **random search** performs better than grid search, both these approaches are still computationally expensive and time consuming.

### What is HP choice? ›

hp. choice(label, options) **Returns one of the options, which should be a list or tuple**. The elements of options can themselves be [nested] stochastic expressions. In this case, the stochastic choices that only appear in some of the options become conditional parameters.

### What is better than GridSearchCV? ›

Depending on the n_iter chosen, **RandomSearchCV** can be two, three, four times faster than GridSearchCV.

### What is the best N_estimators in random forest? ›

We may use the **RandomSearchCV method** for choosing n_estimators in the random forest as an alternative to GridSearchCV. This will also give the best parameter for Random Forest Model.

### Is random search faster than grid search? ›

Once again, **the Grid Search outperformed the Random Search**. This is most likely due to the small dimensions of the data set (only 2000 samples). With larger data sets, it's advisable to instead perform a Randomized Search.

### Which hyperparameter to tune first? ›

The first hyperparameter to tune is **the number of neurons in each hidden layer**. In this case, the number of neurons in every layer is set to be the same. It also can be made different. The number of neurons should be adjusted to the solution complexity.

### What is hyperparameter optimization in deep learning? ›

So then hyperparameter optimization is **the process of finding the right combination of hyperparameter values to achieve maximum performance on the data in a reasonable amount of time**. This process plays a vital role in the prediction accuracy of a machine learning algorithm.

### Is loss function a hyperparameter? ›

I even consider the **loss function as one more hyperparameter**, that is, as part of the algorithm configuration.

### Which of the following is the best for hyperparameter tuning? ›

**Hyperopt** is one of the most popular hyperparameter tuning packages available. Hyperopt allows the user to describe a search space in which the user expects the best results allowing the algorithms in hyperopt to search more efficiently.

### What's the difference between Bayesian search and random search? ›

Unlike the grid search and random search, which treat hyperparameter sets independently, **the Bayesian optimization is an informed search method, meaning that it learns from previous iterations**. The number of trials in this approach is determined by the user.

### How do I stop overfitting? ›

**Handling overfitting**

- Reduce the network's capacity by removing layers or reducing the number of elements in the hidden layers.
- Apply regularization , which comes down to adding a cost to the loss function for large weights.
- Use Dropout layers, which will randomly remove certain features by setting them to zero.

### What is HyperOpt? ›

HyperOpt is **an open-source Python library for Bayesian optimization developed by James Bergstra**. It is designed for large-scale optimization for models with hundreds of parameters and allows the optimization procedure to be scaled across multiple cores and multiple machines.

### What is TPE algorithm? ›

Tree-Structured Parzen Estimator (TPE) algorithm is **designed to optimize quantization hyperparameters to find quantization configuration that achieve an expected accuracy target and provide best possible latency improvement**.

### How do I speed up my Hyperopt? ›

You can speed up the process significantly by **using Google Colab's GPU resources**. The actual code you need is straightforward. We set the trials variable so that we can retrieve the data from the optimization, and then use the fmin() function to actually run the optimization.

### What is Hyperopt Sklearn? ›

Abstract Hyperopt-sklearn is **a software project that provides automated algorithm configuration of the Scikit-learn machine learning library**.

### What is trials in Hyperopt? ›

As per Hyperopt code: `Trials - **a list of documents including at least sub-documents** ['spec'] - the specification of hyper-parameters for a job ['result'] - the result of Domain.evaluate().

### Is Tpe a Bayesian? ›

**The Tree Parzen Estimator is one algorithm that uses Bayesian reasoning** to construct the surrogate model and can select the next hyperparameters using Expected Improvement. There are a number of libraries to implement SMBO in Python which we will explore in further articles.

### What is TPE in Hyperopt? ›

**Tree of Parzen Estimators** (TPE)

### What is TPE suggest in Hyperopt? ›

suggest and hyperopt. tpe. suggest **provides logic for a sequential search of the hyperparameter space**. The maximum number of evaluations. The trials object (optional).