Jonathan Munene Data Engineer Lead Consultant - SwissPeaks East Africa [R, Shiny, Tableau, XLSForms, Stata,SPSS, ODK, SurveyToGo, Data Science & Machine Learning].

SPSS Define new variables in empty data set

The Problem;

When working with SPSS syntax, you would like to create a new datafile, i.e. one that does not have any records/data. If you try the normal compute,string or Var1 (f1.0), you will get an error;

Error # 100.  Command name: COMPUTE
This command is not permitted before the beginning of file definition
commands.
This command not executed”

The workaround for this is;

NEW FILE. 
INPUT PROGRAM. 
NUMERIC v1 (f2) v2 (f2) v3 (f2). 
STRING str1 (a8). 
NUMERIC v4 (f3) v5 (f5). 
END FILE. 
END INPUT PROGRAM. 
EXECUTE. 

or

DATA LIST FREE /VAR1 to VAR100.
FORMATS VAR1 TO VAR100 (F1.0).
BEGIN DATA.
END DATA.
EXECUTE.

Digital questionnaires scripting pro tips

When working on evaluations of development projects,Market Research data or any other form of data collection, it is extremely important to get high quality data from the point of collection, which is made possible by scripting the tools right.

Start with a paper questionnaire

It is always useful to start with a paper questionnaire. Once you have scripted (or programmed) a relatively final draft of your paper questionnaire, you can make modifications directly on the script and it is advisable to make changes in the paper questionnaire simultaneously, or make notes of your changes.

Quick Scripting tips

  1. Create pre-coded options or a finite range of responses wherever possible.
  2. If you plan to use *Stata/SPSS, keep your field names lowercase and 32 characters or shorter.
  3. Create checks and balances—like prompts that help guard against illogical responses—in the questionnaire wherever possible.
  4. Think through how you want your data to look (sometimes sketching it out can help) and then work backwards.
  5. Avoid overlaping repeat groups. Make sure that the groups are completely encased within one another.
  6. When writing long complicated code like constraint, relevance, it is easier to first write the code using a text editor (I like Notepad++), and then copy it to Excel, or scripting studio.
  7. Script in ONE language first, once the tool has been tested, you can now go ahead and add the translations.
  8. Write the questions in a way that is easily read and understood.
  9. Break up the large repeat groups or questions with long choice lists into smaller ones.
  10. Test smaller bits of your script as you keep building up.

Test the questionnaire

  • Always test your questionnaire before training or roll out for data collection to make sure it works without errors.
  • Don’t rely only on the “Preview” function of your scripting system; also, try it out on the tablet/phone you will be using in the field-NOTE “USING IN FIELD” Some of the tablets/phones used by scripters for testing might be more advanced or have bigger screens than the actual ones.
  • In addition to making sure that the questionnaire runs smoothly, export the data and look through it to make sure that your data is in the format you want it in.
  • Enumerators are also good script testers.
  • Finally use real world respondents to test the questionnaire.

While this may seem like a lot of work, it is always beneficial when the same person who scripts the questionnaire also trains the enumerators and cleans the data. This way, the data is in a format you are comfortable working in, and if the data are a mess you only have yourself to blame!

SPSS working directory

The Problem;

When working with SPSS syntax, the GET FILE, GET DATA, WRITE OUTFILE or SAVE OUTFILE, generally any command to open or save needs to have the full path where you want the file saved or retrieved from. For instance, if I want to save vars1 to Vars8 of my SPSS file, in my documents folder, I’d type: SAVE OUTFILE="C:\Users\User\Documents\Vars1_Vars8.sav"/COMPRESSED KEEP= Vars1 to Vars8. If my path is long or I have to keep saving and getting files, I can just change my working directory in SPSS using the cd command and from there on I just have to worry about the file names only. To achieve this;

  • At the begining of your syntax file, type: cd "C:\Users\User\Documents"..
  • If am to save the same Vars1 to Vars8, the syntax now would be: SAVE OUTFILE="Vars1_Vars8.sav"/COMPRESSED KEEP= Vars1 to Vars8..

This has saved me a ton of key strokes and worry of if I have the right path over the couple of years.

How to filter rows in Excel that have count greater than 1

The Problem;

Assuming I have an excel workbook of some thousand rows, which has a unique ID column and would like to filter out rows where the ID has been duplicated and show how many times each ID is appearing.. To achieve this;

  • At the end of my file, I’ll a column named count.
  • Assuming my ID column is column B, I’ll add this formula to my Count column and drag down. =COUNTIF($B$2:$B$28,B2). Make sure you edit the size of you file, i.e. rows, mine is upto 28.
  • Each row will now have a value of the number of times the ID value has been repeated, under the Count column.
  • I can now use the normal filter and select all values greater than 1.

How to check to convert a string variable to numeric/categorical in SPSS

The Problem;

Assuming you have a variable that you collected data as string, or you imported CSV/Excel data into SPSS, categorical data is imported as string, and not numeric with values. This makes it hard to carry out a couple of analysis. To achieve this;

  • Create a new numeric variable, and add the value labels.
  • Create an if statement, that will assign the string statement to the correct value in the new numeric variable.

For my case, QRC1, which is a string variable and has days of the week.


* Create numeric QRC1 - RC1_Num.

NUMERIC QRC1_Num (f2.0).
Execute.
VARIABLE LABELS
QRC1_Num "Recoded QRC1".
EXECUTE.
VALUE LABELS 
QRC1_Num
1	"Monday"
2 "Tuesday"
3 "Wednesday"
4 "Thursday"
5 "Friday"
6 "Saturday"
7 "Sunday".
EXECUTE.

* Assign QRC1 string into QRC1_Num.
IF (QRC1 = "Monday") QRC1_Num=1.
IF (QRC1 = "Tuesday") QRC1_Num=2.
IF (QRC1 = "Wednesday") QRC1_Num=3.
IF (QRC1 = "Thursday") QRC1_Num=4.
IF (QRC1 = "Friday") QRC1_Num=5.
IF (QRC1 = "Saturday") QRC1_Num=6.
IF (QRC1 = "Sunday") QRC1_Num=7.
EXECUTE.

How to check for duplicate values in a column in a dataframe in R

The Problem;

Assuming you have a variable that you collected unique IDs like HHIDs and would like to see if there are any duplicates;

#The required libraries
require(tidyverse)

#Load the libraries
library(tidyverse)  

hhIDs_Dups <- 
  DF %>% 
  group_by(hhID) %>% # the group of interest (hhIDs)
  mutate(duplicate = n()) %>% # count number in each group
  filter(duplicate > 1) %>% # select only duplicated records
  select(-duplicate) # remove group count column

If you now View(hhIDs_Dups), you will only see the duplicate records.

Its advisable to add some metadata to your DF like location and other respondent info for easier identification of which record to update/change.

How to update fields in a dataframe in R

The Problem;

Assuming you have a data frame that has data collected as baseline, or have already started data collection and you change your data collection too - by adding new variables, for instance var100,var202 & var347, the following will happen;

The already collected data, lets call it dfA will have blanks on the newly added variables once merged with the newly collected data. When running you analysis, the outputs for var100,var202 & var347, variables will be NULL, thus give out the wrong numbers :(.

The Solution;

Based on the method you are using to collect your data, you can either;

  • Create a spreadsheet file with the variables that are missing data from dfA, or
  • Create a short survey that the team will use to go back to the field and fill in.

Whichever option you use, make sure that you extract all relevant metadata from dfA that will be used as the key identifier and put it in the above created survey, lets call it dfB.

Once dfB has been filled in, you can now use the power of R to fill in the blanks in dfA using the data collected in dfB.

#The required libraries
require(data.table)

#Load the libraries
library(data.table)  

We wiil now use join from data.table. Convert the “data.frame” to “data.table” (setDT(dfA), join on with dfA using “key” and assign (:=), the values in var100,var202 and var347 with i.var100, i.var202 and i.var347.

Just run the following code

setDT(dfA)[dfB, c('var100', 'var202','var347') := .(i.var100, i.var202,i.var347), on = "key"]

If you now View(dfA), the data will be updated.

Introduction to Data Science in R Part I

R is a powerful language used widely for data analysis and statistical computing. It was developed in early 90s. Since then, endless efforts have been made to improve R’s user interface. The journey of R language from a rudimentary text editor to interactive R Studio and more recently Jupyter Notebooks has engaged many data science communities across the world.

I have created the first part in a series of Introduction to Data Science in R, please click on this link Introduction to Data Science in R Part I

Generating Random Directional Steps in SurveyToGO

How to randomly select direction and steps to be followed during data collection. Incase you want to specify the direction and number of steps to be followed by your enumerators, the following guide will help you out.

Using text fields from a repeat group as choices in your XLSForm.

I want to know how many health facilities are within the respondents area/village and have a follow up question that asks of these facilities, which one is closer to the respondent.

XLSForms Authoring Tips and Tricks

Expert advice on how to author XLSForms.Have you ever wondered what a good XLSForm looks like, then this guide should help you author one of them

Improving performance of your XLSForm

Expert advice on how to improve performance on your XLSForms. You might be able to improve the performance of your form by altering some aspects of its design. To begin with, it's helpful to understand the two key reasons why users might encounter sluggishness when filling out a form.

Getting rid of the NaN in XLSForms calculations.

How to get rid of NaN in XLSForms Calculations. Sometimes its annoying when you want to display the value of calculation to the enumerator, but you get a NaN, this post will show you a workaround to this issue.

Generating Random Numbers in SurveyToGO

You might want to generate a random number from an array, and use it later in the survey. To do this, we’ll create a global var as follows;

var steps = RandomizeArray([20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]);

The above will return a number between 20 and 50, selected randomly.

How to read all files in a folder and merge them into one file in R

If you have a folder with individual files and would like to merge them into one master file, you can use the following script.

#The required libraries
require(data.table)
require(readxl)

#Load the libraries
library(data.table)  
library(readxl)  

> We are using the package `readxl` because our files are Excel workbooks.

#Read the files (note this should be point to the folder where the files are located)
files <- list.files(path = "D:/Jmunene@Ona/xlsforms/Tz Water Points/data/57 LGAs PBR (August Update)",pattern = ".xls")
temp <- lapply(files, read_excel)

#Merge the files by rows. (Note: If the files don't have the same columns, the columns that don't match will be appended)
data <- rbindlist(temp ,fill = TRUE)

#Save the file, as a csv file - but you can choose the format you want :)
write.csv(data,file = 'Merged 57 LGAs PbR.csv',row.names = FALSE)