Naming variables (particularly in Stata) | Advice to doctoral students

Naming variables (particularly in Stata)

A consistent scheme for naming your variables is very helpful. It makes coming back to a project after it’s been under review for 3 months much easier and is especially valuable when collaborating with someone else. This is one of those points where there are bad practices and good practices, but no "right" practice. More important is consistent project within (ideally across) projects. So, as a starting point for your consideration, here is what I have developed over time, through lots of trial and error. I think this approach make it easy to find variables and understand their provenance.

We're aiming for a balance of quick and easy to type, while being easy to understand. Remember that Stata allows us to use descriptive labels for output, meaning the reader won't ever see our variables name, so don't call a variable, "theYearThatYourParentsEnteredTheCountry". Something like yearParentsEntered is sufficiently descriptive. On the other hand, ype is probably going too far in abbreviation.

Capitalization. My strongest feeling here is to avoid ALLCAPS names. They are hard to read, hard to type, and scream 1990s software. Favor, instead, lower case. Well, actually, what's called "Camel case" by some, since it has "humps" in the middle. For example, workEthic. I find that relatively easy to read and type.

Modified variables. Indications of modifications of variables go at the end. This has several advantages. If we want to describe all versions of the workEthic variable, we can use the command "describe workEthic*". We can also take advantage of auto-complete more easily. Examples to follow include:

workEthicSq. The square of workEthic.
workEthicCu. The cube of workEthic.
workEthicC. The centered version of workEthic
workEthicSt The standardized (mean 0, sd 1) version of workEthic
workEthic and workEthic2. Suppose we have two ways we've measured work ethic. Perhaps the first is based on the average of three variables and the second is based on the average of just two of those variables. Name one of those workEthic and the other workEthic2. A third version would be, of course, worthEthic3. We want to avoid the confusion that results from having multiple versions of a variable with the same name. We don't want to wonder which workEthic was used in a given regression.

Add indicators of modification in the order in which the modifications were made, from left to right. So, workEthic2CSq means we took the second version of workEthic, centered it, and then squared that. workEthic2SqC means we took the second version of workEthic, squared that, and then centered the result (which is a really weird thing to do). workEthic1C and workEthic2C are the results of centering the first and second versions of workEthic, respectively.

yr2000. Stata doesn't allow variables to start with a number. This is most often a problem when raw data has observations labeled by year, e.g, "2000". Renaming by prepending "yr" is helpful when it comes time to reshape the data, as it allows us to refer to the stub "yr". If we just preceded the variable with "y" (as I used to, sigh), then if you have a variable named youngKids, you have to take extra steps to work around it. Very few other variable names start with "yr".

There are some characters allowed in variable names in programs that don't work well or at all in Stata. These include ampersands, dashes, hashmarks(#), periods, commas, parentheses, question marks, exclamation marks, asterisks, and probably about any other punctuation.

Just FYI, here's why I now avoid several things I used to do. I no longer use underscores, e.g., work_ethic, because they add lots of typing and add length to variable names without adding information or much readability over camelCase. I don't use the number "2" to indicated squared variables, e.g., workEthic_2, because it can be confused with the version of the variable. Lastly, I try to avoid descriptors such as "original", "new", "old", "main", "regular", "alternative", etc., because they aren't easy to standardize and tend to pile up over time. Which came first, workEthicNew or workEthicAlternative? Did workEthicOriginal come as is from the raw data or was it or first attempt at forming the measure? Etc.

You will often get data in which these norms have not been followed. Actually, I've never gotten raw data in which they have all been followed. It's not worth renaming all of the original variables, especially if a lot of them won't be used. My preference would be to add lines to the variable manipulation version of the do file if and when it becomes obvious that we'll use a variable. So, if we're going to use var1 a lot, it's worth doing

clonevar var1 yearPartentsEntered

and subsequently using yearPartentsEntered. If you group those together in the code, it's pretty easy to track down the original origin of any variable.

On the other hand, if you are going to average ten measures called Var1-Var10 to generate the workEthic variable and then just use that, I probably wouldn't bother renaming the original variables either to the camelCased var1 or to the more informative finishJobImportant, lazyPeopleBad, etc. Just not worth it when we can just do egen workEthic=rmean(Var1-Var10).