[1910.11424v1] Capacity, Bandwidth, and Compositionality in Emergent Language Learning
We hypothesize that this is due to the large number of examples in our language, which almost forces the model to generalize, but note that there are likely additional biases at play that warrant further investigation.

Abstract Many recent works have discussed the propensity, or lack thereof, for emergent languages to exhibit properties of natural languages. A favorite in the literature is learning compositionality. We note that most of those works have focused on communicative bandwidth as being of primary importance. While important, it is not the only contributing factor. In this paper, we investigate the learning biases that affect the efficacy and compositionality of emergent languages. Our foremost contribution is to explore how capacity of a neural network impacts its ability to learn a compositional language. We additionally introduce a set of evaluation metrics with which we analyze the learned languages. Our hypothesis is that there should be a specific range of model capacity and channel bandwidth that induces compositional structure in the resulting language and consequently encourages systematic generalization. While we empirically see evidence for the bottom of this range, we curiously do not find evidence for the top part of the range and believe that this is an open question for the community.
‹

(Sigma * cup * N)__star N(Sigma * cup * N)__star to(Sigma * cup * N)__star

Figure 1: Main results for model A showing best and worst performances of the proposed metrics over 10 seeds. See Section ?? for detailed analysis. Panels (a) and (f) show the accuracy of the training data, (b) and (d) show entropy, (e) and (g) show recall over the test data, and (c) plots the max difference in accuracy between training and test. (Results)Figure 2: Histograms showing precision and recall over the test set, and entropy results for model A. Each bit/parameter combination is trained for 10 seeds over 200k steps. Precision and Recall are computed as described in Eqs. (??) and (??) with M = |Dtest| and N = 10000. (Results)Figure 3: Model A Entropy vs Overfitting: Charts showing per-bit results for Entropy vs (Train Validation) over the parameter range. Observe the two models in bits 23 and 24 which were too successful in producing a non-compositional code and consequently overfit to the data. (Appendix)Figure 6: Entropy metric for models A and B as described in §??. Similarly to our analysis of model A in the main section, we see that model B’s chart supports the view we had of its recall in Figure ??. (Appendix)Figure 4: Efficacy results for models A and B. (Appendix)Figure 5: Precision and recall for models A and B. Similarly to model A, we see that model B has perfect precision. However, its recall chart shows a different story. The first takeaway is that while there is still a strong region in the top right bounded below by ∼ 360k parameters, it does not extend to 22 bits on the left side. This supports our notion of a minimal capacity threshold but adds a wrinkle in that this architecture influences the model’s ability to succeed with fewer bits. (Appendix)Figure 7: Results when running Model A with N = 4 categories instead of N = 6. Here we need at least dlog2 104 e = 14 bits to cover all the input combinations. Observe that there is not much difference in the histograms to the N = 6 scenario. (Appendix)›