IdentifierSimilarity

IdentifierSimilarity is an extension of Namesake: A Checker of Lexical Similarity in Identifier Names. In this project, Namesake is used to "score" identifiers for similarity against each other in order to further assess lexical similarity in code and the impact of this similarity on the level of ease developers have debugging code in a survey setting.

Namesake is an open-source tool for assessing confusing naming combinations in Python programs. Namesake flags confusing identifier naming combinations that are similar in:

orthography (word form)
phonology (pronunciation)
or semantics (meaning)

💡 What is Lexical Similarity in Code?

Lexical access describes the retrieval of word shape (orthography), pronunciation (phonology), and meaning (semantics) from memory during reading for comprehension.

Orthographic similarity focuses on the the similarity in word form on the level of letters. Not to be confused by editing distance or Levenshtein's distance, where one letter is replaced by another, orthographic similarity focuses on the similarities between letters shapes. A good example is the confusion between 'O' and 'C' as individual letters or within words and sentences. Here's a common example in code:

Survery participants were presented either code snippet A (left), which contains the orthographically dissimilar identifiers "l" and "u" or code snippet B (right), which contains the orthographically similar identifiers "x" and "y".

Phonological similarity describes two words that share a similar or identical pronunciation, also known as homophones:

Survery participants were presented either code snippet A (left), which contains the phonologically dissimilar identifiers "pt" and "err" or code snippet B (right), which contains the phonologically similar identifiers "pare" and "pair".

Semantic similarity describes words that share a meaning (synonyms):

Survery participants were presented either code snippet A (left), which contains the semantically dissimilar identifiers "element" and "var" or code snippet B (right), which contains the semantically similar identifiers "short_string" and "little_string".

Note that there are 4 different potential code snippet versions of each similarity genre available to be presented to any given participant in the survey.

❓ How are Identifiers Scored for Similarity?

Identifier pairs are awarded a score 0-100 based on how similar they are in an orthographic (look), phonological (sound), or semantic (meaning), basis.

An example of scoring for phonological similarity, using the International Phonetic Alphabet (IPA) to find the most similar IPA of out-of-vocabulary and IPA of in-vocabulary words.

This below output is an example of a Namesake warning that two identifiers within a code snippet are similar.

📊 Data Collection & Results

In a survey context, we screened the skill level of local programmers and presented them with a 12-question quiz based on code snippet prompts.

Using buggy code snippets containing commonly-confused identifiers, we assessed the debugging ability of programmers as follows:

⚙️ Installing Namesake:

first, to install the requirements:

pip install -r requirements.txt

🚀 Running Namesake:

To run Namesake on the file test1.py (with optional similarity thresholds):

python namesake.py test1.py [orth_threshold] [phon_threshold] [sem_threshold]

Threshold values must be between 0 and 1.

👀 Example Running Namesake:

📝 Citation:

Naser Al Madi. 2022. Namesake: A Checker of Lexical Similarity in Identifier Names. In Proceedings of The 37th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW 2022).

⚖️ License:

MIT (Free Software, Hell Yeah!)

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
data		data
snippets		snippets
README.md		README.md
code_snippets.py		code_snippets.py
homophones.ipynb		homophones.ipynb
output.txt		output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IdentifierSimilarity

💡 What is Lexical Similarity in Code?

❓ How are Identifiers Scored for Similarity?

📊 Data Collection & Results

⚙️ Installing Namesake:

🚀 Running Namesake:

👀 Example Running Namesake:

📝 Citation:

⚖️ License:

About

Releases

Packages

Contributors 3

Languages

tamsinrogers/IdentifierSimilarity

Folders and files

Latest commit

History

Repository files navigation

IdentifierSimilarity

💡 What is Lexical Similarity in Code?

❓ How are Identifiers Scored for Similarity?

📊 Data Collection & Results

⚙️ Installing Namesake:

🚀 Running Namesake:

👀 Example Running Namesake:

📝 Citation:

⚖️ License:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages