Much social science research involves capturing and analyzing data. Normally, data will be placed in a machine readable file via an application such as Excel, JMP, SPSS, or SAS. Data may be directly input into a digital form or first placed on a paper form and later input.
The nature of coding will vary with the sort of data that you have. Interval - ratio data typically is the easiest because number are entered as they are, i.e. an age value of 64 is entered as 64. Nominal and ordinal values require more thought. For example a study of male and female library science students would likely include a variable "sex." Values would be male and female. Additional values might be added for "unknown" or "missing." Handling the data would be easier if you did not enter "male" or "female" onto your form. You might decide to use "m" or "f" instead. If you do, you have just made a coding decision.
Coding involves two decisions:
Traditionally, computers manipulated numeric values more quickly than alphabetical ones [string values] so numbers were often used to represent values. However, this means that you need to be careful not to forget that 1 for female is a label and not a real number. Using alphabetical values should prevent the erroneous use of nominal data and most contemporary statistical software packages will handle them with ease.
A code book is a guide for coding. It clearly shows which characters are associated with which values for each variable being studied. Without the code book, you or another using your data set might not know what a 1 for the variable "sex" represents. The code book is a complete guide to coding. If given to another, she should be able to code exactly as you would. Information needed to code accurately and quickly might include operational definitions, examples, and precedents established while coding.
A case represents one unit of whatever is being sampled. In bibliometric research, a case might be one bibliographic citation. In survey research of library users, a case might be one person interviewed. Depending on how many variables are being studied, a case might occupy a few rows in a spread sheet or many.
You will decide on the number of variables to be examined. Each variable will have two or more values and you will select those with some thought, considering the population as well as your time and effort.
Characters need to be allocated for missing values, but those characters should be clearly different so that they are not included in any statistical analysis where they should be excluded. Most statistical software allows you to identify and exclude missing values from analysis.
Traditionally, demographic information -- information that identifies the unit measured and provides basic descriptive information should come first. An example for a person surveyed might be:
An example for a bibliographic citation might be:
Software varies in how it handles blank spaces. It is usually better to use a 0 rather than a blank or another character to indicate a missing value.
Reliability is the degree to which different coders would yield the same result in coding the same data or the degree to which the same coder would yield the same result in coding the same data at different times.
Without reliability, coding is of little value. "Garbage in, Garbage out."
You should be confident that your data collection and coding are reliable. Have another code a small, but typical sample of your data and compare results. The reliability should be better than 95 percent. Do this before you code your 3799 cases. Questions to ask as you compare the results of the two codings: