Skip to content

A Python tool that uses Namesake to further explore the impact of natural language processing mechanisms in screening for lexical similarity in code.

Notifications You must be signed in to change notification settings

tamsinrogers/IdentifierSimilarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IdentifierSimilarity

IdentifierSimilarity is an extension of Namesake: A Checker of Lexical Similarity in Identifier Names. In this project, Namesake is used to "score" identifiers for similarity against each other in order to further assess lexical similarity in code and the impact of this similarity on the level of ease developers have debugging code in a survey setting.

Namesake is an open-source tool for assessing confusing naming combinations in Python programs. Namesake flags confusing identifier naming combinations that are similar in:

  • orthography (word form)
  • phonology (pronunciation)
  • or semantics (meaning)

💡 What is Lexical Similarity in Code?

Lexical access describes the retrieval of word shape (orthography), pronunciation (phonology), and meaning (semantics) from memory during reading for comprehension.

Orthographic similarity focuses on the the similarity in word form on the level of letters. Not to be confused by editing distance or Levenshtein's distance, where one letter is replaced by another, orthographic similarity focuses on the similarities between letters shapes. A good example is the confusion between 'O' and 'C' as individual letters or within words and sentences. Here's a common example in code:

Screenshot 2024-08-13 at 10 11 37 AM

Survery participants were presented either code snippet A (left), which contains the orthographically dissimilar identifiers "l" and "u" or code snippet B (right), which contains the orthographically similar identifiers "x" and "y".

Screenshot 2024-08-13 at 10 17 01 AM Screenshot 2024-08-13 at 10 18 43 AM

Phonological similarity describes two words that share a similar or identical pronunciation, also known as homophones:

Screenshot 2024-08-13 at 10 12 00 AM

Survery participants were presented either code snippet A (left), which contains the phonologically dissimilar identifiers "pt" and "err" or code snippet B (right), which contains the phonologically similar identifiers "pare" and "pair".

Screenshot 2024-08-13 at 10 19 22 AM Screenshot 2024-08-13 at 10 19 51 AM

Semantic similarity describes words that share a meaning (synonyms):

Screenshot 2024-08-13 at 10 12 16 AM

Survery participants were presented either code snippet A (left), which contains the semantically dissimilar identifiers "element" and "var" or code snippet B (right), which contains the semantically similar identifiers "short_string" and "little_string".

Screenshot 2024-08-13 at 10 22 47 AM Screenshot 2024-08-13 at 10 23 05 AM

  • Note that there are 4 different potential code snippet versions of each similarity genre available to be presented to any given participant in the survey.

❓ How are Identifiers Scored for Similarity?

Identifier pairs are awarded a score 0-100 based on how similar they are in an orthographic (look), phonological (sound), or semantic (meaning), basis.

An example of scoring for phonological similarity, using the International Phonetic Alphabet (IPA) to find the most similar IPA of out-of-vocabulary and IPA of in-vocabulary words.

Screenshot 2024-08-13 at 10 36 10 AM

This below output is an example of a Namesake warning that two identifiers within a code snippet are similar.

Screenshot 2024-08-13 at 10 36 32 AM

📊 Data Collection & Results

In a survey context, we screened the skill level of local programmers and presented them with a 12-question quiz based on code snippet prompts.

Screenshot 2024-08-13 at 11 28 45 AM Screenshot 2024-08-13 at 11 29 02 AM Screenshot 2024-08-13 at 11 29 50 AM

Using buggy code snippets containing commonly-confused identifiers, we assessed the debugging ability of programmers as follows:

Screenshot 2024-08-13 at 10 16 19 AM Screenshot 2024-08-13 at 10 16 40 AM

⚙️ Installing Namesake:

first, to install the requirements:

pip install -r requirements.txt

🚀 Running Namesake:

To run Namesake on the file test1.py (with optional similarity thresholds):

python namesake.py test1.py [orth_threshold] [phon_threshold] [sem_threshold]

Threshold values must be between 0 and 1.

👀 Example Running Namesake:

Screenshot 2024-08-13 at 10 12 31 AM

📝 Citation:

Naser Al Madi. 2022. Namesake: A Checker of Lexical Similarity in Identifier Names. In Proceedings of The 37th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW 2022).

⚖️ License:

MIT (Free Software, Hell Yeah!)

About

A Python tool that uses Namesake to further explore the impact of natural language processing mechanisms in screening for lexical similarity in code.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published