Learning LaTeX and the result compared to Word

LaTeX is a high quality typesetting system. It is open source and more important it isĀ Free Software. This past weekend I decided to learn some LaTeX as it has been an interest in the back of my mind for some time. At first I was hesitant because LaTeX was made for Mathematical and Scientific papers, which I don’t write. The impetus for Donald Knuth, the author of the underlying language (TeX), was that there weren’t good tools for the documentation and display of Mathematical formulas. However, my concern was misguided. LaTeX can make any document look beautiful, and it can be used for any kind of article, book, or even a resume. What sets it apart from WYSIWYG editors, like Microsoft Word, is the sheer typographical quality of the resulting documents you can produce. LaTeX algorithms under the hood calculate everything from page line height to word and letter spacing. They can be adjusted as you like and many packages exist that make the process easier.

My mother has produced a book recently using Microsoft Word, and it was no easy task. The index-making progress is difficult, headers inexplicably stop showing up correctly, page numbers stop respecting section boundaries, and blank pages pop up everywhere in the PDF result. Furthermore, the WYSIWYG nature of Word encourages you to manually edit spacing issues with the wrong tools, and if you are picking up from where someone else left off, then good luck to you reformatting everything. Even after reformatting, if text is changed and pages shift, you have to redo your work. I wanted to convince myself and my mother that the book can be produced with LaTeX in one or two nights, and look better than its Word counterpart. I was able to do it in just one afternoon.

First I found that LaTeX supports a book document type. You declare it in the first line of your document. However, after adding more pieces to the puzzle, I learned about the Koma-Script package which provides a drop-in replacement for the book (and article & report) class, packed with some additional goodies. There is also the memoir class, which was an interesting alternative. I loaded its book class with the scrbook class.

\documentclass[12pt,letterpaper]{scrbook}

I found that I did not even need to install anything as it was already installed in the LaTeX package bundled with Ubuntu. I haven’t tried it yet, but I’ve read that MiKTeX is a good package to get started with LaTeX on Windows and it makes it easy to use this and other useful packages. I plan on getting my mother to try MiKTeX once I show her my LaTeX version of the book (which undeniably looks better than the Word version). However, for this proof of concept, I didn’t use any editors specially designed for LaTeX, as I of course was working in EMACS. EMACS has a LaTeX mode with useful key bindings and syntax highlighting. I immediately got started copying all the chapter titles like this:

\chapter[Optional short name for the TOC]{My Very Long Chapter Name Here}

I did not have to wrap pagraphs in any tags as you simply skip a line to indicate that it is a new paragraph.

This book had many quotations and blockquotes, and many of them were formatted improprly in Word. Word doesn’t make that easy. I didn’t have to worry about any of that, as in LaTeX I am only semantically tagging them, not styling them. Styling comes later, when you’re done tagging, though I found that even the default styling was impressive. Here’s what the markup looks like:

\begin{quote}
All that is gold does not glitter,
Not all those who wander are lost;
The old that is strong does not wither,
Deep roots are not reached by the frost.

From the ashes a fire shall be woken,
A light from the shadows shall spring;
Renewed shall be blade that was broken,
The crownless again shall be king.
\end{quote}

Koma-Script’s scrbook gives useful variations on subsections, like addsec* and minisec, for example. The * is a modifier that keeps it from appearing in the Table of Contents (TOC).

\minisec*{My mini subsection name}
Blah Blah

Creating the index was refreshingly sane. I simply went into the points of interset in the text and dropped \index{key} tags and I was done. Once I did that, text can be added or removed and pages can shift, but we have no additional work to do as it’s all recalculated for you. All pages with the same key get pointed to in the index under the same entry. Sequential page ranges get smartly hyphenated. Footnotes were just as easy. For this book, footnotes were not used, but instead, endnotes. I googled for endnotes and found that there was a package for it already. Once again, I did not even have to download it, as it was already included in my LaTeX package. I wanted the endnote numerbers to reset every chapter, as it is in the Word version, and there’s a package for that too.

\usepackage{endnotes,chngcntr}
\counterwithin*{endnote}{chapter}  % Reset endnote numbering every new chapter

This is a brief overview of some of the tags that I used that I hope highlight how easy this was to do. Then I generated the document directly to PDF. Without even thinking about styling yet, the document that was produced was a typograhically stunning work. With a couple of easy tweaks, I purposely made it look closer to the Word document for comparison purposes, to highlight the superiority of the type produced by LaTeX. Unfortunately I can’t produce the “final” proof of concept here, as it is an entire book and I don’t hold rights to it. It would not be entirely fair for me to omit that there is a learning curve with LaTeX, of course. However, I hope that this helps anyone just starting or curious about learning LaTeX.

First try at Data Munging

I’ve been taking Udacity course Exploratory Data Analysis and decided that I wanted to try my hand at a real data set that I cared about. I ran into several obstacles that are probably common and I hope that this will help someone else.

The data I cared about was in SQL Server so first I got the data out:

bcp "select .. from .. where .." queryout data.dat -c -t"||||" -S server -U user -P pass

I chose “||||” as my delimiter because I was fairly sure that no value had four pipe characters. It’s much easier to search the file for a good delimiter once it’s in a text file. Once the data was out, I searched through the file data.dat and found that there were no asterisks in the entire file so I replaced all “||||” with “*” as my delimiters.

sed -i 's/||||/*/g' data.dat

I tried to load this into R with mydata <- read.csv("data.dat", sep="*") but ran into a problem:

Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 2 appears to contain embedded nulls

I eventually realized that anything which was either NULL or an empty string in the SQL Server database comes out as 0x00, a binary null character. EMACS represents the binary null as ^@. I replaced these binary marks with ‘NA’ in EMACS with M-x replace-string ENT ^@ ENT NA ENT. As a side note, you can position the cursor on a symbol you want to know about and do M-x describe-char, it will tell you a lot of information about it. Another way to replace the symbol if you haven’t experienced the life and file altering wonders of EMACS is

sed -i 's/\x0/NA/g' data.dat

Now I tried read.csv and it seemed to work without errors, but I noticed that the number of ‘observations’ that R thinks are in the file (dim(mydata)) is not the same as the number of lines in the file, so I knew something was wrong. To see the number of lines in a file you can do wc -l output.dat in the terminal.

It took me quite some time to figure it out. The following finally worked correctly:

mydata <- read.table("data.dat", na.strings=c("", "NA"), sep="*", comment.char="", quote="")

?read.csv reveals that it actually calls read.table internally and makes some assumptions for you. One of those assumptions is sep="," but we specified that. The ones that got me were comment.char and quote. Actually, read.csv assumes that comment.char is "" which disables commenting altogether, which is good (for my data), but read.table sets it to "#". Additionally, read.csv sets quote="\"" by default. Initially after using read.table rather than read.csv, I started getting these types of errors:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 9237 did not have 8 elements

I checked the line it complained about but it had 8 elements. I know that sometimes errors happen earlier than where the error message indicates. For a sanity check, I wrote this quick little diddy in Python to check the element count on each line:

#!/usr/bin/env python

linenum = 0
badlines = []

with open('data.dat', 'r') as orders:
    for line in orders.readlines():
        linenum = linenum + 1
        count = line.split('*');
        if not len(count) == 8:
            badlines.append(linenum)

print badlines

However, this came back with an empty array so I knew that there was something else going on. Once I took a closer look at the documentation though, and set quote="", disabling quotes altogether, I finally had no errors, and had the correct number of observations.

Also, while in the help page for read.table/read.csv, I found that na.strings was helpful to tell R to interpret blank fields as NA. By setting na.strings=c("", "NA"), we're telling R to interpret both "" and "NA" as NA.

There's more data manipulation I may need to do but for now I can finally start looking at the data.