1


[1908.10714] Automated Architecture Design for Deep Neural Networks
I do not consider these findings very relevant but consider them to be due to random noise in the experiments multiple runs of the search algorithms will give more statistically significant results and may come up with a different ordering in the resulting networks complexity, since the difference between the network architectures does not seem very significant in the experiments that I ran
Abstract: Machine learning has made tremendous progress in recent years and received
large amounts of public attention. Though we are still far from designing a
full artificially intelligent agent, machine learning has brought us many
applications in which computers solve human learning tasks remarkably well.
Much of this progress comes from a recent trend within machine learning, called
deep learning. Deep learning models are responsible for many stateoftheart
applications of machine learning. Despite their success, deep learning models
are hard to train, very difficult to understand, and often times so complex
that training is only possible on very large GPU clusters. Lots of work has
been done on enabling neural networks to learn efficiently. However, the design
and architecture of such neural networks is often done manually through trial
and error and expert knowledge. This thesis inspects different approaches,
existing and novel, to automate the design of deep feedforward neural networks
in an attempt to create less complex models with good performance that take
away the burden of deciding on an architecture and make it more efficient to
design and train such deep networks.
‹Figure 3: Binary classification problem. Yellow area is one class, everything else is the other class. Right is the shallow neural network that should represent the classification function. Figure taken from Bhiksha Raj’s lecture slides in CMU’s ’11785 Introduction to Deep Learning’. (Neural Networks as Universal Function Approximators)Figure 4: Decision boundary for a square Figure 5: Decision boundary for a hexagon Figure 6: Decision plot for a square Figure 7: Decision plot for a hexagon Figure 8: Decision plots and boundaries for simple binary classification problems. Figures taken from Bhiksha Raj’s lecture slides in CMU’s ’11785 Introduction to Deep Learning’. (Neural Networks as Universal Function Approximators)Figure 9: Decision plot and corresponding MLP structure for approximating a circle. Figure taken from Bhiksha Raj’s lecture slides in CMU’s ’11785 Introduction to Deep Learning’. (Neural Networks as Universal Function Approximators)Figure 10: Decision boundary and corresponding twolayer classification network. Figure taken from Bhiksha Raj’s lecture slides in CMU’s ’11785 Introduction to Deep Learning’. (Relevance of Depth in Neural Networks)Figure 11: Possible network topology changes, taken from Waugh [1994] (Dynamic Learning)Figure 12: The cascade correlation neural network architecture after adding two hidden units. Squared connections are frozen after training them once, crossed connections are retrained in each training iteration. Figure taken and adapted from Fahlman and Lebiere [1990]. (Constructive Dynamic Learning)Figure 13: Performance of the neural network found using manual search. Two hidden layers of 512 units each, using the tanh activation function in the hidden units and softmax in the output layer. Trained using RMSProp. Values averaged over 20 training runs. (Manual Search)Figure 14: Simplified pseudo code for the implementation of evolving artificial neural networks (Evolutionary Search)Figure 15: Animation of how the population in the evolutionary search algorithm changes between iterations (best viewed in Adobe Acrobat). (Evolutionary Search)Figure 16: Exploration of the network architecture search space using different search algorithms. Hidden activation function and optimizer are omitted. The color encoding is the same for all three plots. (Conclusion)Figure 17: Exploration of the neural architecture search space for evolutionary search (with or without duplicates in the population), when removing all those architectures that were present in the initial population. The lower the activity in the search space, the more the exploration depends on the initial population. Hidden activation function and optimizer are omitted. The color encoding is the same for all three plots. (Conclusion)Figure 18: Cascadecorrelation learning algorithm, as proposed by Fahlman and Lebiere [1990]. The algorithm was run ten times, with a candidate pool of size eight, training each hidden unit in the candidate pool for two epochs and then choosing the one with the highest validation accuracy. This unit is then added into the network and trained until convergence (i.e. until the validation accuracy doesn’t improve for three epochs in a row). Results are averaged over the ten runs, with the shaded area representing the 95% confidence interval. (CascadeCorrelation Networks)Figure 19: Caser algorithm, as originally proposed by Littmann and Ritter [1992]. Results are averaged over the ten runs, with the shaded area representing the 95% confidence interval. (CascadeCorrelation Networks)Figure 20: Unpredictable behavior when adding new units into the Caser network. Left plot shows the Caser network using a candidate pool size of eight, whereas on the right, a candidate pool of size 16 was used. Green dotted lines show the insertion of a new hidden unit into the network. (CascadeCorrelation Networks)Figure 21: Reusing the output weight for all units in the candidate pool for Caser. Results are averaged over the ten runs, with the shaded area representing the 95% confidence interval. Lighter colored lines show the single runs. (CascadeCorrelation Networks)Figure 22: Caser’s dependence on the initial weight vector. On the left, the network finds a good initial local minimum whereas on the right, the network finds a worse local minimum and does not improve its performance significantly. (CascadeCorrelation Networks)Figure 23: Caser, reusing the previous output weight vector if all units in the candidate pool decrease the networks accuracy by more than 5%. (CascadeCorrelation Networks)Figure 24: Using a candidate pool of seven new units and one unit reusing the previous output weights. Results averaged over ten runs, with the shaded area representing a 95% confidence interval. Lighter colored lines show the single runs. (CascadeCorrelation Networks)Figure 25: Using a candidate pool of three new units and one unit reusing the previous output weights. Adding a total of 100 cascading hidden units. Results averaged over two runs, with the shaded area representing a 95% confidence interval. Lighter colored lines show the single runs. (CascadeCorrelation Networks)Figure 26: Using a candidate pool of three new units and one unit reusing the previous output weights. Adding a total of 50 cascading hidden layers of 50 units each. Results averaged over five runs, with the shaded area representing a 95% confidence interval. Lighter colored lines show the single runs. (CascadeCorrelation Networks)Figure 27: Using a candidate pool of three new units and one unit reusing the previous output weights. Adding a total of 15 cascading hidden layers of 100 units each. Results averaged over five runs, with the shaded area representing a 95% confidence interval. Lighter colored lines show the single runs. (CascadeCorrelation Networks)Figure 28: Training and validation accuracy per epoch in forward thinking. Results are averaged over 20 runs, the shaded areas show the 95% confidence interval. (Forward Thinking)Figure 29: Training and validation loss per epoch in forward thinking. Results are averaged over 20 runs, the shaded areas show the 95% confidence interval. (Forward Thinking)Figure 30: The automated forward thinking algorithm, trained for ten layers. Resulting network has the layers: [950, 700, 700, 500, 50, 200, 500, 850, 550, 350]. (Automated Forward Thinking)Figure 31: Training stopped too early. Figure 32: Training stopped too late. Figure 33: Automated forward thinking with early stopping when the validation accuracy does not increase after adding a layer. The network on the left has two layers: [950, 1000], whereas the network on the right has six layers: [950, 500, 150, 300, 50, 300]. (Automated Forward Thinking)Figure 34: The automated forward thinking algorithm run 20 times. Shaded area shows the 95% confidence interval. (Automated Forward Thinking)›



Related: TFIDF
[1807.02816] Improving Deep Learning through Automatic Programming[1505.02000] Deep Learning for Medical Image Segmentation[1905.06010] Automatic Model Selection for Neural Networks[1810.05526] Automatic Configuration of Deep Neural Networks with EGO[1807.05292] Neural Networks Regularization Through Representation Learning[1712.07420] Finding Competitive Network Architectures Within a Day Using UCT[1901.06261] NeuNetS: An Automated Synthesis Engine for Neural Network Design[1909.03306] A greedy constructive algorithm for the optimization of neural network architectures[1706.05719] Towards the Improvement of Automated Scientific Document Categorization by Deep Learning[1801.08577] Effective Building Block Design for Deep Convolutional Neural Networks using Search
