IdentifierSimilarity is an extension of Namesake: A Checker of Lexical Similarity in Identifier Names. In this project, Namesake is used to "score" identifiers for similarity against each other in order to further assess lexical similarity in code and the impact of this similarity on the level of ease developers have debugging code in a survey setting.
Namesake is an open-source tool for assessing confusing naming combinations in Python programs. Namesake flags confusing identifier naming combinations that are similar in:
- orthography (word form)
- phonology (pronunciation)
- or semantics (meaning)
Lexical access describes the retrieval of word shape (orthography), pronunciation (phonology), and meaning (semantics) from memory during reading for comprehension.
Orthographic similarity focuses on the the similarity in word form on the level of letters. Not to be confused by editing distance or Levenshtein's distance, where one letter is replaced by another, orthographic similarity focuses on the similarities between letters shapes. A good example is the confusion between 'O' and 'C' as individual letters or within words and sentences. Here's a common example in code:
Survery participants were presented either code snippet A (left), which contains the orthographically dissimilar identifiers "l" and "u" or code snippet B (right), which contains the orthographically similar identifiers "x" and "y".
Phonological similarity describes two words that share a similar or identical pronunciation, also known as homophones:
Survery participants were presented either code snippet A (left), which contains the phonologically dissimilar identifiers "pt" and "err" or code snippet B (right), which contains the phonologically similar identifiers "pare" and "pair".
Semantic similarity describes words that share a meaning (synonyms):
Survery participants were presented either code snippet A (left), which contains the semantically dissimilar identifiers "element" and "var" or code snippet B (right), which contains the semantically similar identifiers "short_string" and "little_string".
- Note that there are 4 different potential code snippet versions of each similarity genre available to be presented to any given participant in the survey.
Identifier pairs are awarded a score 0-100 based on how similar they are in an orthographic (look), phonological (sound), or semantic (meaning), basis.
An example of scoring for phonological similarity, using the International Phonetic Alphabet (IPA) to find the most similar IPA of out-of-vocabulary and IPA of in-vocabulary words.
This below output is an example of a Namesake warning that two identifiers within a code snippet are similar.
In a survey context, we screened the skill level of local programmers and presented them with a 12-question quiz based on code snippet prompts.
Using buggy code snippets containing commonly-confused identifiers, we assessed the debugging ability of programmers as follows:
first, to install the requirements:
pip install -r requirements.txt
To run Namesake on the file test1.py (with optional similarity thresholds):
python namesake.py test1.py [orth_threshold] [phon_threshold] [sem_threshold]
Threshold values must be between 0 and 1.
MIT (Free Software, Hell Yeah!)