First try at Data Munging

I’ve been taking Udacity course Exploratory Data Analysis and decided that I wanted to try my hand at a real data set that I cared about. I ran into several obstacles that are probably common and I hope that this will help someone else.

The data I cared about was in SQL Server so first I got the data out:

bcp "select .. from .. where .." queryout data.dat -c -t"||||" -S server -U user -P pass

I chose “||||” as my delimiter because I was fairly sure that no value had four pipe characters. It’s much easier to search the file for a good delimiter once it’s in a text file. Once the data was out, I searched through the file data.dat and found that there were no asterisks in the entire file so I replaced all “||||” with “*” as my delimiters.

sed -i 's/||||/*/g' data.dat

I tried to load this into R with mydata <- read.csv("data.dat", sep="*") but ran into a problem:

Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 2 appears to contain embedded nulls

I eventually realized that anything which was either NULL or an empty string in the SQL Server database comes out as 0x00, a binary null character. EMACS represents the binary null as ^@. I replaced these binary marks with ‘NA’ in EMACS with M-x replace-string ENT ^@ ENT NA ENT. As a side note, you can position the cursor on a symbol you want to know about and do M-x describe-char, it will tell you a lot of information about it. Another way to replace the symbol if you haven’t experienced the life and file altering wonders of EMACS is

sed -i 's/\x0/NA/g' data.dat

Now I tried read.csv and it seemed to work without errors, but I noticed that the number of ‘observations’ that R thinks are in the file (dim(mydata)) is not the same as the number of lines in the file, so I knew something was wrong. To see the number of lines in a file you can do wc -l output.dat in the terminal.

It took me quite some time to figure it out. The following finally worked correctly:

mydata <- read.table("data.dat", na.strings=c("", "NA"), sep="*", comment.char="", quote="")

?read.csv reveals that it actually calls read.table internally and makes some assumptions for you. One of those assumptions is sep="," but we specified that. The ones that got me were comment.char and quote. Actually, read.csv assumes that comment.char is "" which disables commenting altogether, which is good (for my data), but read.table sets it to "#". Additionally, read.csv sets quote="\"" by default. Initially after using read.table rather than read.csv, I started getting these types of errors:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 9237 did not have 8 elements

I checked the line it complained about but it had 8 elements. I know that sometimes errors happen earlier than where the error message indicates. For a sanity check, I wrote this quick little diddy in Python to check the element count on each line:

#!/usr/bin/env python

linenum = 0
badlines = []

with open('data.dat', 'r') as orders:
    for line in orders.readlines():
        linenum = linenum + 1
        count = line.split('*');
        if not len(count) == 8:
            badlines.append(linenum)

print badlines

However, this came back with an empty array so I knew that there was something else going on. Once I took a closer look at the documentation though, and set quote="", disabling quotes altogether, I finally had no errors, and had the correct number of observations.

Also, while in the help page for read.table/read.csv, I found that na.strings was helpful to tell R to interpret blank fields as NA. By setting na.strings=c("", "NA"), we're telling R to interpret both "" and "NA" as NA.

There's more data manipulation I may need to do but for now I can finally start looking at the data.

Leave a Reply

Your email address will not be published. Required fields are marked *