An Empirical Study on The Nature of Method Names

Last modified: 2020-02-02

Program comprehension is crucial for engineers during software development. EASY-TO-UNDERSTAND code must have meaningful and succinct identifiers and names for the program entities so that engineers can quickly grasp the key functionality of the code. In facts, misleading names for program entities in a regular program or the APIs in a software library make the code harder to understand and maintain, even leading to defects and API misuses. In practice, companies have been emphasizing on naming conventions and coding standards.

Names of methods are particularly important, and can be difficult to choose. "Methods are the smallest named units of aggregated behavior in most conventional programming languages and hence the cornerstone of abstraction". Ideally, "if you have a good method name, you don’t need to look at the body". Hence, generating the good name for a method is beneficial not only for recommending the appropriate name for the method at the first time when the method was written, but also for detecting the written bad name and suggesting the alternative.

In this work, we conducted an empirical study to answer the following questions on the nature of method names:

  • Q1: What are the characteristics of method names regarding their uniqueness and sizes?
  • Q2: How is the relation between method names and the method body (refered as implementation context), the types of parameters and the return type (called interface context), and enclosing contexts which is the containing class in a program?
The answers to the questions provide the empirical foundation on whether the tokens/names in contexts could have predictive impacts on predicting method names. (I believe) That could be useful for one who wants to design the method to suggest a method name.

Data collection and processing

We used the massive scale dataset of 14,317 top-ranked, high-quality, long-history Java projects on GitHub, that was used in a prior work. In this dataset, all duplicated Java files and migrated projects and the forks of the same projects are filtered out. This dataset includes 2,127,355 files and 17,012,754 methods. It has the latest, stable versions of the projects, ensuring the method names at the stable and good standing. For each method in the dataset, we collected the method’s name, the parameters’ names and types, the return type, the class name, and the names of the variables, fields, and method calls within the body of the method. We tokenized each of those names using Camelcase and underscore naming conventions, and the tokens are normalized to lowercase.

Uniqueness and Sizes of Method Names

In the dataset, there are 3,402,550 unique method names and 120,303 unique tokens in method names. On average, there are 2.64 tokens per method name, and the median is 3 tokens. The longest one contains 83 tokens. Meanwhile, the number of tokens in a method body is 17.3 times greater than that of a method name. There are 95% method bodies whose numbers of tokens are 3.0 times greater than the method names. Most of method names (with few tokens) are multiple times shorter than the corresponding method bodies.

Method name Token
Mean #occcurrence 4.8 400.3
Median #occcurrence 1 3
#occcurrence = 1 62.9% 21.9%
#occcurrence > 1 37.1% 78.1%

As seen in the above table, for method names, nearly 2 out of 3 names (62.9%) are unique. This leads to that a high number of method names (≈ 2/3 of the names) cannot be identified by searching in the set of cases that are previously encountered. The mean value 4.8 of the occurrences of method names is due to a large number of occurrences of common names, e.g., toString, equals, and hashCode.

In contrast, for the tokens as parts of method names, a token is usually used repeatedly multiple names. 3 out of 4 tokens (78.1%) used to comprise a method name are likely to be previously seen as parts of other method names. Thus, we can conclude that a method name is often comprised of the tokens that have been previously seen.

These above results support us to use a generative summarization approach to learn the tokens composing method names from the tokens that are previously encountered in other method names.

The Relation between Method Names and Contexts

Common tokens shared between a method name and the contexts

For a method, we first computed the percentage of the tokens of the method name that also appear in its contexts. In the below left figure, on average, for a method, 65.0% of the tokens in its name are found in the names of the program entities in the contexts (the median is 66.7%). Note that the mean and median number of tokens in a method name are 2.7 and 3, respectively. Thus, on average, about 2 out of 3 tokens of a method name can be found in the three contexts. As seen in the figure, the highest percentage of the tokens in a method name is found in the body (62.0%), while the next ones are in interface context (14.9%) and enclosing context (6.1%).

Generic placeholder image
% of tokens in method names found in contexts
Generic placeholder image
The percentages of the methods whose names share certain tokens with the contexts

The reasons for such trend among three contexts are as follows. First, the implementation context usually contains the largest number of tokens, and the number of tokens in a class name is usually less than that of interface context (including the names, and the return type and parameters’ types). Second, since for a method, its name describes the functionality, while the implementation context describes how the functionality is realized, the implementation context has the closest relation with the method name. Meanwhile, the interface context describes how the method interacts with other entities, thus having a stronger relation with the method name compared to the enclosing context, which describes the general context for the method and others in the same class.

We also aim to explore the pervasiveness of the sharing tokens between method names and the contexts. Specifically, we calculated the percentages of the methods whose names share certain proportions of tokens with the corresponding contexts. As seen in the right figure above, we found that 84.6% of the methods are given the method names in which at least 33.3% of the tokens (1 out of 3 tokens) are found in the contexts. In 79.8% of the methods, at least half of the tokens in method names are in the contexts. Especially, 36.7% of the methods is comprised of the tokens, such that all of the tokens in method names are in the contexts. Thus, there are high percentages of the methods whose names share with those of the entities in the context.

The conditional occurrences of tokens in the method names on the contexts

We investigated the conditional occurrences of the tokens in method names on those of program entities in the contexts. For a method, we computed the conditional occurrence as the conditional probability that the token t is used in the method name given the tokens from the names in the contexts. For a context, the conditional probability of t given context C is computed as: P(t|C) ≈ Cooccur(t,C)/Occur(C)where Cooccur(t,C) is the number of methods whose names contain t, and their contexts are identical to C. Occur(C) is the number of methods whose contexts are identical to C. The higher P(t|C), the stronger power the context C provides to predict the token t.

We found that on average, the occurrence of a token in method name conditionally on the implementation context is 35.9%. That is, when encountering all the tokens in a method’s body, in 35.9% of the cases, we could see a token in the method’s name. Meanwhile, the input interface (parameters), output interface (return type), and enclosing contexts provide certain indication to predict the tokens in method names, in which the conditional occurrences of a token in method name on each of those contexts are 18.8%, 17.1%, and 8.3%, respectively. Especially, there is a considerable number of tokens that are always found in the method names when certain contexts are encountered (i.e., the case when P(t|C) = 1). Specifically, the percentages for the implementation, input interface, output interface, and enclosing contexts are 5.9%, 2.2%, 1.5%, and 0.63%, respectively. For example, the enclosing context of smtp mail sender always contains the methods whose names contain the token send. Among the cases of P(t|C) = 1, the percentage of the cases in which the token t does not appear in C where C is implementation context is 43.6%. The respective percentages of such cases for input, output, and enclosing contexts are 89.7%, 76.6% and 82.6%. Thus, even the tokens are not in the contexts, the contexts can be used to predict the tokens in method names due to high conditional occurrences.

Conditional occurrences of tokens in inconsistent and alternative good names on the contexts

We also study the capability of the contexts in making distinction between a consistent name and an inconsistent (bad) name of a method. Specifically, we used a dataset from Liu et al., which contains a set of methods whose names and their bodies are inconsistent (about 1.4K cases), as well as the corresponding good names (about 1.4K cases) that were used by real-world developers to replace the inconsistent names of those methods.

The below figure shows the average conditional probabilities on the occurrences of the inconsistent and alternative good names of the tokens in the method names given each of the contexts. As seen, for all contexts, the average P(t|C) of the tokens in inconsistent method names is relatively much lower than that in the alternative consistent names. Thus, each context can provide the indication of the occurrences of the tokens in good names more than those in inconsistent names.

Generic placeholder image
Avg conditional occurrence of tokens in method names on the contexts

Conclusion

Our results provide empirical evidence to confirm the principle of naturalness of software that also holds for the tokens composing the names of the methods and the names of program entities in the contexts. The tokens composing the names of the entities in the contexts appear together regularly and naturally due to the intention of developers in realizing the method’s functionality. That functionality is captured by an abstract, succinct method name. Thus, the appearances of the tokens in the entities in the contexts can have impact on those of the tokens in the method name. For example, in the method getHostAddress, which aims to retrieve the host address, the tokens of the program entities (inet, Addr, Local, Address, Listen, etc.) are relevant to achieve that task. Moreover, our results also confirm the benefits of using code contexts for name prediction as in the prior work for code-to-texts.

We conclude the following important results in this study:

  1. 62.9% of the full method names are unique. For a given method, one cannot rely solely on searching for a good name in the data of the previously seen method names
  2. 78.1% of the tokens in method names can be found in the other previously seen method names.
  3. There are high proportions of the tokens (average of 65.0%) of method names which are shared with the three contexts. There are high percentages of the methods (79.8%) whose tokens in names share (+50%) with the tokens of the entities’ names in the contexts.
  4. When encountering all the tokens of the name of the program entities used in the body of a method, in 35.9% of the cases, we could see a token in the method’s name.
  5. Even the tokens are not found in the contexts, one could use the contexts to predict the tokens in the method names due to those high conditional occurrences
  6. Each context provides the indication of occurrences of the tokens in the good names more than in inconsistent (bad) names.

References

  • Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering (ICSE ’12). IEEE Press, 837–847.
  • M. Allamanis and C. Sutton, "Mining source code repositories at massive scale using language modeling," 2013 10th Working Conference on Mining Software Repositories (MSR), San Francisco, CA, 2013, pp. 207-216.
  • Einar W. Høst and Bjarte M. Østvold. 2009. Debugging Method Names. In Proceedings of the 23rd European Conference on ECOOP 2009 --- Object-Oriented Programming (Genoa). Springer-Verlag, Berlin, Heidelberg, 294–317.
  • 1999. Refactoring: improving the design of existing code. Addison-Wesley Longman Publishing Co., Inc., USA.
  • Simon Butler, Michel Wermelinger, Yijun Yu, and Helen Sharp. 2009. Relating Identifier Naming Flaws and Code Quality: An Empirical Study. In Proceedings of the 2009 16th Working Conference on Reverse Engineering (WCRE ’09). IEEE Computer Society, USA, 31–35
  • Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2014. Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). Association for Computing Machinery, New York, NY, USA, 281–293.
  • Kui Liu, Dongsun Kim, Tegawendé F. Bissyandé, Taeyoung Kim, Kisub Kim, Anil Koyuncu, Suntae Kim, and Yves Le Traon. 2019. Learning to spot and refactor inconsistent method names. In Proceedings of the 41st International Conference on Software Engineering (ICSE ’19). IEEE Press, 1–12.