The Programmer's Probe

An exploration of open source technologies, algorithms and optimization, Scala and Java, and other programming-related curiosities

Do You Care if People Understand You?

The last S in the popular acronym K.I.S.S.  (Keep it Simple Stupid) needs to be emphasized.  It is told to someone who is being intellectually lazy, which implies that those who keep it simple are the ones using their gray matter effectively.  The world needs to be more conscious that information is not transmitted magically.  It requires both acknowledgement of the gap between neurons in distinct minds, and also creativity to reshape a block of knowledge into a way that caters to the listeners.

President Obama said “APIs”


Briefly digressing, open-source software is free, publicly available software, which powers much of the internet.  It’s also permitted to modify it to your specific purposes.  Closed-source software is the opposite, and benefits only one person or company.  Open-source thrives not on profit, but rather on public awareness and a sense of community, due to it being dependent on reciprocation in some form.

I attended a conference on open-source software, OSCON 2013, and I witnessed a tragedy.  The U.S. government sponsors a web site, http://www.whitehouse.gov, which promotes “Open Government”, “Open Data”, and “Open-Source”.  See the mission of this specific department here:  http://www.whitehouse.gov/developers.  At OSCON 2013, Leigh Heyman, Director of New Media Technologies for the Executive Office of the President, showed us a video.

It was a simple interaction between himself and none other than President Barack Obama himself, where the President inquires to Leigh what his department is working on next.  Leigh responded with “APIs”.  The President parroted his words, and then, as you might expect from a seasoned officeholder, jocularly conceded he had no idea what an API was, followed by laughter throughout the room.  We at OSCON laughed as well, taking pride in the moment where “a techie like us” got the President to utter a technical acronym.  However, this was nothing less than a potentially tremendous but lost opportunity.  While API, Application Programming Interface, is one of the most common technical terms in modern software engineering, it’s still a technical acronym that no one should expect the listener to have knowledge of. Leigh, instead, could have said “We’re developing an interface for other software to access our data easily over the internet.”

The Director of a publicly-funded, high-exposure technology department should certainly have prepared in advance a simple, canned summary of the future direction of his group’s efforts.  Effective elevator pitch-style explanations are not easy to create, but if you want to be able to consistently explain something on a moment’s notice, you must think it through and practice.  This was a rare opportunity to transfer knowledge of the benefits of the open-source community and data sharing to a figure of such influence.  Instead, this was reduced to a cute moment where “the open-source community affected something high-profile”.  The recent George Zimmerman trial was interrupted when a witness gave testimony remotely via the internet (using Skype).  People sent chat requests to the witness, able to interrupt something on public television.  That is what Leigh’s interaction with the President equated to, and it could have been so much more.  Speaking directly to the President is not in any way relying on the Butterfly Effect to advance your goals.

The TCP tuning talk


Fast-forward to another much more technical talk at OSCON, on improving the performance of TCP, delivered by Jason Cook of a digital content delivery internet company, Fastly.  TCP is one of the core methods by which computers on the internet can communicate.  Hoping for deeper level insights into the subtleties of TCP, I attended this talk.

Jason started by talking about how TCP works, mentioning how two computers initiate communication sending “SYNs” and “ACKs” to each other.  I personally have dove deep into TCP before, and realized he was describing the “handshaking” procedure two computers use to start communicating with each other.  I knew TCP confirms messages are received by using “ACK”nowledgement messages, but I forgot why messages are called SYNs.  Most programmers communicate through TCP using pre-built functionality that alleviates concerns of the inner workings of TCP, but here, the speaker was assuming those attending his talk would be familiar with it, reviewing some key points.  He was also using acronyms that many in the audience would not understand.  He continued from there speaking very fast, using equally technical terminology.

This demonstrates a couple of additional problems.  If you only cater to a particular level of expertise, your audience should be appropriately filtered, perhaps by explicit language in the abstract of a talk.  It is true Jason only had limited time to cover the intended subject matter, but if he wanted to educate a wider audience, he should have pared down the information as necessary to allow for more explanation and cogitationwithin the minds in his audience, which could have included anyone within the open-source community.

Practicing what I preach


Before I personally write or say anything, I run it against what I want to assume the reader or listener knows.  Perhaps I shouldn’t have assumed people know who George Zimmerman is, but enough do, and the point I was making was not dependent on this knowledge.  However, I did not assume you knew what TCP was.  Most programmers today would know what TCP is, but I want non-programmers to understand everything in this article too.  I did not tell you that TCP stood for “Transmission Control Protocol” because I wanted to keep it simple and not distract.  Being able to write or speak in a way that, for a target audience, is understandable by most if not all of them, and yet does not stray significantly from the over-arching purpose, is not an easy task, but should be the intention of any individual that endeavors to communicate

I recently wrote an article on Git, a popular open-source software used to store the incremental versions of software code.  At first, I only intended to write an article about how I use it, but then I realized I would like the article to be useful to anyone just getting into programming who may not necessarily be familiar with Git.  I proceeded to add a very lengthy tutorial, with small steps and plenty of screen shots.  It’s true there are many other tutorials out there on Git, but Git is not the easiest tool to learn, and that tells me there’s a need for someone with my unassuming approach to write a tutorial about it.  I could have likely saved a fair bit of time and found the most unassuming Git tutorial on the internet and referred to it.  This is a reasonable compromise, but it only works if what you refer to is also geared toward your intended audience.

If I wanted to, I could have been much lazier and just left the tutorial off altogether and targeted a smaller audience–perhaps those already familiar with Git.  If I did, I would have stated clearly at the beginning whom I’m writing for, and it would have saved me a lot of time.  Anyone who wasn’t very familiar with Git that ignored that warning would be doing so at their own peril.  If I left that warning off, I am selfishly imposing on them the unexpected burden of doing significant research and self-learning in order for their investment in reading time to be worthwhile.

Simplifying is not easy


There is a wonderful article geared toward programmers, Solving Embarrassingly Obvious Problems In Erlang.  You don’t have to read that article, however, to get my point.  The author’s purpose is to promote maintainability of software code, or how understandable the code is to a programmer unfamiliar it.

A function is a named block of instructions, with inputs and an output.  Most computer programs are constructed of many functions.  A simple example is:

add(a, b) { return a+b }

In this example, the function name is “add”, “a” and “b” are inputs, and the output is the sum.  A programmer can choose to write a very complex function, that does numerous things in order to compute the return value.  The author suggests breaking functions down into tiny functions, each named in a way that fully describes the purpose of the function.  When breaking up a function, he suggests first asking, for each line, “What’s going on here?” and then moving that line to a function with a name answering that question.  This results in the functions providing a form of documentation describing the code.  Understanding what a function does becomes very obvious for any future programmer maintaining the software.

A key point in this article is that answering the question “What’s going on here?” is noteasy.  It takes some thought to really break down what is being said into tiny, very maintainable components.   This process is necessary, however, because this information must be conveyed somehow.  If you do not cater toward the unfamiliar, than only the familiar will be able to maintain your code.  Otherwise, you should either acknowledge that your code cannot be changed without you and others intimately familiar with it being available, or, you are making it much more painful for future developers to be able to work with your code.  They’ll have to logically deduce why you do everything that you do, and they’ll have to often scour the internet and other resources to assist with this.  Some parts of your code may be outright impossible to understand without some explanation.

Aside from that obvious benefit, the “What am I really trying to say?” mentality increases your own understanding.  You must be knowledgeable to be explicit.  This article demonstrates this by showing that when you are explicit about what your code is doing, you are afforded the opportunity to optimize it due to your increased understanding.  Likewise, when catering an explanation to a less knowledgeable target audience, you are not only forced to truly master the subject matter, but you are also able to state it more clearly.

Another great thing about this article is that the author uses the Erlang programming language, but fully understands his audience may be completely unfamiliar with Erlang.  He makes a conscious choice to targetprogrammers of all languages, addressing the difficulty in understanding Erlang syntax early on, then he remains sensitive to that choice.

One small hump can block a mountain of understanding


Erring on the side of unfamiliarity when assuming the knowledge level of your target audience is the safer practice, but try to not do more than you have to by encouraging the listeners to volunteer what they already know.  However, learn explicitly what they do know.  Sometimes an audience may generalize about an area of expertise, which can give you a false sense of security while explaining a concept.  Intersperse your explanation with confirmations of familiarity.  Reassure them that there’s nothing wrong with only being familiar with a portion of a topic and not everything.

In a presentation about “Khan Academy”, a web site with thousands of free online lectures, Salman Khan talks about the student analysis tools it.  They allow, among other things, a teacher to track the time a student spens on each problem throughout a homework assignment.  What they found was that seemingly slower students, when given assistance with a particular problem they were “stuck on”, would often catch up or even exceed the other students once they were given the necessary explanation they needed to surmount the obstacle.

This demonstrates the importance of not “skipping” elements of an explanation.  Explanations are naturally constructed in a hierarchy of understanding.  There is a set of fundamentals one must understand in order to master more advanced intermediate concepts required for them to comprehend your explanation.  The two sides to this are:


  • Without explaining the hierarchy of prerequisite learning in its entirety, they cannot fully understand your explanation.
  • The effort to immediately address points of confusion is worthwhile.

Other suggestions for successful explanation


There are other common mistakes I find impedes the successful transference of information:

  • Hearing is not automatic.  Often communication takes place in an environment afflicted with noise pollution from chatting coworkers, street noises, or even an air conditioner.  Auditory capability varies between each and every person.  If the listener fails to hear an important detail, it is up to you to accommodate.  You will have an opportunity to refine your initial statement, but you should not diverge very far.  Otherwise, the listener may not recognize the second utterance as repetition and will be concerned about missing information.
  • Interactivity ensures understanding.  You can better ensure comprehension by querying for opinions and predictions relevant to the subject matter.  Unexpected inquiries should be welcomed, as this tests and possibly expands your own understanding of the topic, whether through logical exploration, or future investigation.
  • Recognize learning is hard.  Never assume transference of information will be trivial.  It is very easy for a person to nod their head, assuming, perhaps rightfully, that they understand “enough”, or are somehow distracted and are trying to be polite.  Pride may be also a factor.  With modern education largely consisting of a lecture format, where minds silently attempt to track and accept a continuous inflow of facts and logic unquestioningly, the ability to do so reflects one’s supposed intelligence and ultimately one’s status.  Minds are trained to attempt this style of learning as much as possible, and fear the inability to do so.  This has no place in any realistic approach to the human mind, let alone a simple conversation.  Set the listener at ease in this regard by encouraging questions, pausing to proactively confirm understanding, and perhaps even openly acknowledging complexity.
  • Allow time for critical thinking.  Most minds cannot multitask well, if not at all.  Studies such as this one at UCLA suggest multitasking hinders learning.  That means if the listener is questioning, or extrapolating from anything you say, there is an increased likelihood of reduced comprehension.  Recognize this, and allow time for cogitation.  Be attentive of facial expression and body language, and the complexity or controversial nature of what you are saying, then use your intuition to know when to continue.  Again, suggesting likely questions proactively will be of benefit, as often the listener will be inclined to not impose an interruption.
  • Prioritize information.  If there are time restrictions for transference of information, certainly prioritize on what’s important.  Also, however, realize the minimal subset of information that will be valuable to the listener.  Balance the benefits of a cursory understanding of the entire subject matter with a complete understanding of an independently valuable portion of it.
  • Validate sources.  It is sub-optimal and often ineffective to refer people to sources of information that you have not personally examined.
  • Provide reference material.  After your initial explanation, while verbal communication with Q&A may be the most effective way to provide an introductory explanation to a potentially complex subject, you should never assume complete transmission of knowledge.  The target audience may very well not even recognize incomplete understanding and should have material to refer to, to fill in the gaps as they present themselves.

Conclusion


Knowledge sharing is at the core of much of our professional and personal encounters.  People possess various degrees of background knowledge, experiences, and cognitive abilities, and there are many potential impediments to understanding each and every bit of information.  Aside from attentiveness of the learner, the responsibility for successful transmission of information lies wholly with the one initially in possession of it.  By attempting to explain anything to anyone, you are engaging in a task that has potential benefit, but that also may result in disinterest in your subject matter, discouragement, or at the very least, a waste of time.  This may further evolve into a generalization of the target’s impression of interactions with you, and it is thus of social benefit for the transmissions to be frequently successful.  By consciously expending energy to practice empathy and non-assumption, articulating required background knowledge for their intended audience, one may greatly improve the quality of their interaction with others who have entrusted you with their time, attention, and perhaps curiosity.

Introduction to Git Along With a Sensible Git Workflow


An initial note:  I learned about Git initially with the book Pro Git which I’m amazed is freely available online, so if you want a more comprehensive overview, check that out.

Also, I’d like to amend this post by saying, for change/commit descriptions, it’s best practice to use the imperative style. (e.g. “Add file A”, rather than “Added file A”). It’s too late for me to update all these screenshots.


Table of Contents



What is a Version Control System?


A version control system allows you to store different versions of your code as you make changes.  This allows you to avoid losing your changes, and access your code from a convenient location on the internet.  You can keep some changes separate in its own branch, and merge those changes into the master branch when you’re ready, so the main version of your code is unaffected until you know your changes function properly.  In case you’re unfamiliar with the term “branch”, picture it as a particular version of your entire project.  Let’s say you’re working on a book, and you have a published version, and a “work in progress” version.  You are rewriting a chapter of your book in the “work in progress” section, but you still have the published version separate, so if you want to print out a copy of your book without the partially modified chapter, you can.  When you’re done modifying that chapter, you can merge in your changes to the published version, deleting the modification branch afterward.  You could also have a permanent “student edition” branch that has more detail, and perhaps exercises at the end of each chapter.  You can easily merge the same changes into that branch to keep it up to date, but maintain a separate copy.

Overview of This Article


I haven’t been particularly inspired by any version control system until I dove into Git, created by the founder of Linux, Linus Torvalds.  If you’re already familiar with Git, feel free to skip the tutorial sections.  I’m going to go over the following:

  • Summarize what sets Git apart from other alternatives
  • The basics of Git, so someone new to Git can be productive with it
  • How it’s set up at New York Magazine, where I work
  • My preferred workflow using Git

Key Benefits of Git


  • When you create a new branch, Git doesn’t copy all your files over.  A branch will point to the original files and only track the changes (commits) specific to that branch.  This makes it blazingly fast, as opposed to its main competitor in terms of marketshare, Subversion, which actually laboriously copies the files.
  • Git lets you work on your own copy of a project, merging your commits into the central repository, often on github.com, when you want your commits to be available to others.  Github.com, by the way, will host your project, for free as long as it’s open source.  This means you can reliably access your code from anywhere with an internet connection.  If you lose that internet connection, you can continue to work locally and sync up your changes when you’re able to reconnect.
  • When you screw up, you can usually undo your changes, somehow.  You might need to call in an expert in serious cases, but there’s always hope.  This is the best “key benefit” a version control system can have.
  • Git also lets you keep your commit history very organized.  If you have lots of little changes, it lets you easily rewrite history so you see it as one big change (via something called rebasing).  You can add/remove files in each commit, and certainly change the descriptions of each.  This definitely forces me to consciously repress OCD tendencies.
  • It’s open source, fast, and very flexible, so it’s widely-adopted and therefore well-supported.


Other Notable Features


  • With Git, you can create “hooks”, which let things happen automatically when you work with your code.  A common usage is to create a hook to check the description submitted with each commit, making sure it conforms to a particular format.  Perhaps you have your bugs described in a bug tracking system and each bug has an ID #.  Git can ensure each message has “Bug: SomeNumber”.
  • Another under-appreciated feature is how Git tracks files.  It uses an algorithm called SHA-1 to take the contents of files and produce a large hexadecimal number (hash code).  The same file will always produce the same hash code.  This way, if you move a file to a different folder, it can detect that the file moved, and not think that you deleted one file and added another.  This allows Git to not have to keep two copies of the same file.
  • While Git is not necessarily the most intuitive version control system out there, once you get used to it, you’re able to browse through its internal directories and it all makes complete sense.  Wondering where the file with the hash code “d482acb1302c49af36d5dabe0bccea04546496f7” is?  Check out this file:  ”<your project>/.git/objects/d4/82acb1302c49af36d5dabe0bccea04546496f7”  See how where the file is stored is determined by the hash code?  There are lots of lower-level commands that let you build the operations you want, in case, for instance, git’s merge command doesn’t work how you’d like it to.

Tutorial


So you’re, in whatever programming language, going to start a new project, and you want to use version control?  I’m going to just create a silly, sample application in Scala that’s very easy to understand, to demonstrate.  I’m going to assume you’re familiar with your operating system’s command-line interface, and that you’re able to write something in the language of your choice.

Setup


Github is one of the go-to places to get your code hosted for free.  It gives you a home for your code, that you can access from anywhere.  Initial steps:

  1. Go to http://github.com and “Sign up for Github”
  2. You’ll need Git.  Follow this page step by step:  http://help.github.com/articles/set-up-git
  3. This explains how to create a new repository:  https://help.github.com/articles/create-a-repo
  4. Lastly, you’re going to want to get used to viewing files that start with a “.”  These files are hidden by default, so at the command line, when you’re listing contents of a directory, you need to include an “a” option.  That’s “ls -a” in OSX and Linux, and “dir /a” for Windows.  In your folder options, you can turn on “Show hidden files and folders” as well.

Once you got that far, there’s nothing stopping you, outside of setting aside some play time, from using everything git has to offer.

Clone a Repository


Cloning a repository lets you grab the source code from an existing project that you own, or someone else’s project that you have access to (usually public).  Unless it’s your project, you won’t be able to make changes, so you’re going to “fork” my potayto project, which means to create your own copy of it under your own account, then you can modify that to your heart’s content.  I keep all of my projects locally (on my computer) in a “projects” folder in my home directory, “/Users/sdanzig/projects”, so I’m going to use “projects” for this demo.

First, fork my repository…

I created a sample project on github, as you now should know how to do.

Let’s get this project onto your hard drive, so you can add comments to my source code for me!

First, log into your github account, then go to my repository at https://github.com/sdanzig/potayto  … Click “Fork”:

Fig. 1

Then select your user account, to copy it to.  At this point, it’s as though it were your own repository, and you can actually make changes in the code on github.  We’re going to copy the repository onto our local hard drive, so we can both edit and compile the code there.

Fig. 2

Folder structure

There are a few key things to know about what git is doing with your files.  Type:
cd potayto
There are a couple things to see here.  List the contents in the potayto folder, being careful to show the hidden files and folders:

Fig. 3

The src folder has the source code, and its structure conforms to the Maven standard directory structure.  You’ll also see a .git folder, which contains a complete record of all the changes that were made to the potayto project.the potayto repository and also a .gitignore text file.  We’re not going to dive into the contents of .git in this tutorial, but it’s easier to understand than you think.  If you’re curious, please refer to the online book I mentioned earlier.


git log

A “commit” is a change recorded in your repository.  Type “git log”, and you might have to press your space bar to scroll and type “q” at the end, to quit displaying the file:

Fig. 4

Git’s log shows the potayto project has 3 commits so far, from oldest on bottom, the first commit, to most recent on top.  You see the big hexadecimal numbers preceded by the word “commit”?  Those are the SHA codes I was referring to.  Git also uses these SHA codes to identify commits.  They’re big and scary, but you can just copy and paste them.  Also, you only need to type enough letters and numbers for it to be uniquely identified.  Five should be usually enough.  For this project, you can get away with 4, the minimum.

Let’s see how my first commit started.  To see the details of the first commit, ype:

git show bfaa
Fig. 5

Initially I checked in my Scala application as something that merely printed out “Tomayto tomahto,” “Potayto potahto!”  You can see that near the bottom.  The “main” method of the “Potayto” object gets executed, and there are those two “print lines”.  Earlier in the commit you can see the addition of the .gitignore I provided.  I’m making git ignore my Eclipse-specific dot-something files (e.g. .project) and also the target directory, where my source code get compiled to.  Git’s show command is showing the changes in the file, not the entire files.  Those +’s before each line mean the lines were added.  In this case, they were added because the file was previously non-existant.  That’s why you see the /dev/null there.

Now type:

git show 963e
Fig. 6

Here you see my informative commit message about what changed, which should be concise but comprehensive, so you’re able to find the change when you need it.  After that, you see that I did exactly what the message says.  I changed the order of the lyrics.  You see two lines beginning with “-“, preceding the lines removed, and two lines beginning with “+”, preceding the lines added.  You get the idea.

The .gitignore File, and “git status”

View the .gitignore file.
.cache
.settings
.classpath
.project
target
This is a manually created file that tells git what to ignore.  If you don’t want files tracked, you include it here.  I use software called Eclipse to write my code, and it creates hidden project files which git will see and want to add in to the project.  Why should you be confined to using not only the same software as me to mess with my code, but also the same settings?  Some teams might want to conform to the same development environments and checking in the project files might be a time saver, but these days there are tools that let you easily generate such project files for popular IDEs.  Therefore, I have git ignore all the eclipse-specific files, which all happen to start with a “.”

There’s also a “target” folder.  I’ve configured Eclipse to write my compiled code into that folder.  We don’t want git tracking the files generated upon compilation.  Let those grabbing your source code compile it themselves after they make what modifications they wish.  You’re going to want to create one for your own  projects.  This .gitignore file gets checked in along with your project, so people who modify your code don’t accidentally check in their generated code as well.  Others might be using Intellij, which writes .idea folders and .ipr and .iws files, so they may append this to the .gitignore, which is completely fine.

Let’s try this.  Type:
git status
Fig. 7

You’ll see you’re on the main branch of your project (a version of your code), “master”.  Being “on a branch” means your commits are appended to that branch.  Now create a text file named “deleteme.txt” using whatever editor you want in that potayto folder and type “git status” again:

Fig. 8

Use that same text editor to add “deleteme.txt” as the last line of .gitignore and check this out:

Fig. 9

See, you modified .gitignore, so git doesn’t see your deleteme.txt file.  However, other than this nifty feature, .gitignore is a file just like any other file in your repository, so if you want this “ignoring” saved, you have to commit the change, just like you would commit a change to your code.


Staging Changes You Want to Commit


Here’s one of the fun things with git.  You can “stage” the modified files that you want to commit.  Other version control systems ominously await your one command before your files instantly changed in the repository, perhaps the remote repository for the entire team.

Let’s say you wanted to make a change involving files A and B.  You changed file A.  You then remembered something you’d like to do with file Z, on an unrelated note, and modified that before you forgot about it.  Then you completed your initial change, modifying file B.  Git allows you to “add” files A and B to staging, while leaving file Z “unstaged”.  Then you can “commit” only the staged files to your repository.  But you don’t!  You realize you need to make a change to file C as well.  You “add” it.  Now files A, B, and C are staged, and Z is still unstaged.  You commit the staged changes only.

Read that last paragraph repeatedly if you didn’t follow it fully.  It’s important.  See how Git lets you prepare your commit beforehand?  With a version control system such as Subversion, you’d have to remember to make your change to file Z later, and your “commit history” would show that you changed files A and B, then, in another entry, that you changed file C later.

We won’t be as intricate.  Let’s just stage our one file for now.  Look at Figure 9.  Git gives you instructions for what you can do while in the repository’s current state.  Git is not known for having intuitive commands, but it is known for helping you out.  “git checkout – .gitignore” to undo your change?  It’s strange, but at least it tells you exactly what to do.

To promote .gitignore to “staged” status, type

git add .gitignore

Fig. 10

The important thing to note here is that now your file change is listed under “Changes to be committed” and git is spoon-feeding you what exactly you need to type if you want to undo this staging.  Don’t type this:

git reset HEAD .gitignore
You should strive to understand what’s going on there (check out the Pro Git book I linked to for those details) but for now, in this situation, you just are given means to an end when you might need it (changing your mind about what to stage).

By the way, it’s often more convenient to just type ”git add <folder name>” to add all modifications of files in a folder (and subfolders of that folder).  Also very common to type is ”git add .”, a shortcut to stage all the modified files in your repository.  This is fine as long as you’re sure you’re not accidentally adding a file such as Z that you don’t want to be grouped into this change in your commit history.

It’s also useful to know how to stage the deletion of a file.  Use “git rm <file>” for that.


Committing Changes to Your Repository


Guess what?  We get to do our first commit!  Time to make that .gitignore change official.  Type:

git commit -m “Added deleteme.txt to .gitignore”
Fig. 11

You could just type “git commit”, but then git would load up a text editor, and you’d be required to type a commit message anyway.  In OSX and Linux, “vim” would load up, and in Windows, you’d get an error.  If you prefer a full screen editor in Windows, you can type this to configure it:

git config –global core.editor “notepad”

Mug available on thinkgeek.com
sporting some vi quick reference.
Vim supports all vi commands listed.
If you end up in vim and are unfamiliar with it, realize it’s a very geeky and unintuitive but powerful editor to use.  In general, pressing the escape key, and typing “:x” will save what you’re writing and then exit.  The same syntax will work to choose a new full screen editor in OSX and Linux, of course replacing notepad with the /full/path/and/filename of a different editor.

The full screen editor is necessary if you want a commit message with multiple lines, or in other situations, so if you hate vim, configure git to one you do like.

Enough with this babble.  Fill that VI mug with champagne – you just made your first commit!  If you can contain your excitement, type:
git log
Fig. 12
The change on top is yours.  Oh, what the heck, let’s take a look at it:

Fig. 13

See the +deleteme.txt there?  That was you!  The way this “diff” works is that git tries to show you three lines before and after each of your changes.  Here, there were no lines below your addition. The -3,3 and +3,4 are ranges.  - precedes the old file’s range, and + is for the new file.  The first number in each range is a starting line number.  The second number is the number of lines of the displayed sample before and after your modification.  The 4 lines displayed only totaled 3 before your change.

If you want to revert changes you made, the safest way is to use “git revert”, which automatically creates a new commit that undoes the changes in another commit.  Don’t do this, but if you wanted to undo that “deleteme.txt ignoring” commit which has the SHA starting with 0c22, you can type: “git revert 0c22”

The Origin


You cloned your repository from your github account.  Unless something went horribly wrong, this should be:

https://github.com/<your github username>/potayto.git
Git automatically labels where you cloned a repository from as “origin”.  Remember when I said the internals of a git repository were easily accessible in that .git folder in your project?  Look at the text file .git/config:

Fig. 14

It’s as simple as this.


Branches


Before I explain how to make your changes on the version of your code stored on github, I should first explain more about branches.  I already explained how a branch is a separate version of your code.  A change made to one branch does not affect the version of your repository represented by the other branch, unless you explicitly merge the change into it.  By default, git will put your code on a “master” branch.  When you clone a project from a remote repository (remote in this case means hosted by github), it will automatically create a local branch that “tracks” a remote branch.  Tracking a branch means that git will help you easily determine:

  • See the differences between commits made to the tracking branch (the local one) and the tracked branch (remote)
  • Add your new local commits to the remote branch
  • Put the new remote commits on your local branch

If you didn’t have your local branch track the remote branch, you could still move changes from one to another, but it becomes more of a manual process.  Hey, guess what?  I can easily demonstrate all this in action!  First, type:
git status
Fig. 15

That deleteme.txt ignoring change you made in your local master branch is not yet on Github!  You have one commit that Github’s (the origin) remote master branch (denoted as origin/master) does not yet have.

Don’t do this now, but if you don’t want to make changes directly in your local master branch, you can create a new local branch, perhaps named “testing” by typing “git branch testing”.  Then you can switch to that branch by typing “git checkout testing”.  Then make whatever changes you want, stage and commit them, then switch back to the master branch with “git checkout master”.  You could also create and switch to a new local branch in one command, “git checkout -b testing”.

Pushing to the Remote Repository


Let’s put your change on Github, then we’ll make a change directly on Github and pull it.  Git’s push command, if you don’t provide arguments, will just push all the changes in your local branches to the remote branches they track.  This can be dangerous, if you have commits in another local branch and you’re not quite ready to push those out also.  (I one time accidentally erased the last week of changes in New York Magazine’s main repository doing this.  We did manage to recover them, but, don’t ask.)  It’s better to be explicit.  Type:

git push origin master
Fig. 16

You don’t really need to concern yourself with the details of how Git does the upload.  But as for the command you just typed, git push lets you specify the “remote” that you’re pushing to, and the branch.  By specifying the branch, you tell git to take that particular branch (“master”, in this case) and update the remote branch, on the origin (your Github potayto repository), with the same name (it will create a new remote “master” branch if it doesn’t exist).  If you didn’t specify “master”, it will try to push the changes in all your branches to branches of the same names on the origin (if they exist there.  It won’t create new remote branches in this case).

Anyway, if you type “git status” again, you’ll see your branch now matches the remote repository’s copy of it.  I’d show you, but I can only do so many screen captures, okay?  Also what you can do is type:
git log origin/master
Fig. 17

This is the syntax to see a log of the commits in the master branch on your “origin” remote.  You can see the change is there.  You can also see this list of commits by logging into Github, viewing your Potayto repository, and clicking on this link:
Fig. 18

Pulling Changes from the Remote Repository


While we’re browsing the Github interface, let’s use it to create a change that you can fetch (or pull).  This will emulate someone else accessing the remote repository and making a change.  If you want your local copy of the repository to reflect what’s stored in the remote repository, you need to keep yours up to date by intermittently fetching new changes.  First, let’s create a README.md file which Github will automatically use to describe your project.  Github provides a button labeled “Add a README” for this, but let’s do it the more generic way.  Click the encircled “Add a file” button:
Fig. 19
Now type “README.md” for the name and a description that makes sense to you.

The “md” in the filename stands for “Markdown”, which is a “markup language” that lets you augment your text with different things just like HTML does.  If you want to learn how pretty you can make your README file, you can learn more about Markdown here, but just realize Github uses a slightly modified version of Markdown.

Click the “Commit New File” button:
Fig. 20
You’ll see your project described as desired.  Go back to your terminal window and type:
git status
Fig. 21

Wait a sec… Why’s it saying that your local branch is up to date?  It’s because the git “status” command does not do any network communication.  Even typing “git log origin/master” won’t show the change.  Only Git’s “push”, “pull”, and “fetch” does anything over the network.  Let’s talk about “fetch”, as “pull” is just a shortcut of functionality that “fetch” can do.

When you track a remote branch, you do get a copy of that remote branch in your local repository.  However, aside from those three aforementioned commands that talk over the network, git treats these remote branches just like any other branches.  You can even have one local branch track another local branch.  (Probably won’t need to do that.)

So, how do we update our local copies of the remote branches?  “git fetch” will update all the local copies of the remote branches listed in your .git/config file.  Here, I’ll start adding more shadows to my screenshots, in case you actually aren’t as excited about all this niftiness as I am.  Please type:
git fetch
Fig. 22
Fig. 23
Now, you’ll notice there’s still no difference if you type “git log”, but let’s type:
git log origin/master
Fig. 24
Now you see the remote change.  Type:
git status
 See, this is more like it, but what does “fast-forwarded” mean?  Fast-forwarding is a version of “merging”.  It means there’s no potential conflict.  It means you took all the changes in a branch, such as the remote master branch, and made changes from there, while no new changes were made in the remote branch.  I’ll explain more later, in the section on “rebasing”, but for now, we’re going to pull these changes in.  Type:
git merge origin/master
Fig. 25
This tells you there was one file inserted.  Now if you typed “git log”, you’d see that you brought the change first from the master branch on your Github repository to your origin/master branch, and then from there to your local master branch.  You could even have absolute proof of the change by looking in your current directory:
Fig. 26
See the README.md file?  Of course, there is a short cut.  It’s too late, but you could have done everything in one fell swoop by typing:
git pull origin master
That would have not only fetched the commits from the remote repository, but would also have done the merge.  And if you want to pull all of the branches from all the remote repositories that your .git/config file lists, you can just type:
git pull
You can be as trigger happy as you want with that for now, but when you start dealing with more than one branch, you might update some branches you weren’t yet ready to update.


Merges and Conflicts


Now for the purposes of learning about merges, we’re going to undo that last merge.  Very carefully, type:
git reset HEAD~1 –hard
Fig. 27
The “HEAD~1” means “the 1st commit before the latest commit”, with the latest commit referred to as the “HEAD” of the branch (currently master).  By resetting “hard”, you’re actually permanently erasing the last commit from your local master branch.  As far as Git’s concerned, the last link in the master branch’s “chain” now is the commit that was previously second to last.  Don’t get in the habit of this.  It’s just for the purpose of this tutorial.

Don’t worry – we don’t have to mess with remote repositories for a while.  Your new README.md file is also safely committed to your local repository’s cached version of the remote master branch, “origin/master”.  You could type “git merge origin/master” to re-merge your changes, but don’t do it right now.

Let’s say someone else added that README.md, and you were unaware.  You start to create a README.md in  your local repository, with the intention of pushing it to the remote repository later.   Because we undid our change, there is no longer a README.md file in your current directory.
Normally you’d use a text editor, but for now, type this to create a new README.md file:
echo A test repository for learning git > README.md
Fig. 28
I used the cat command (For Windows, it’d be “type”) to display the contents of the simple file we created.  Let’s stage and commit the thing.  Type:
git add README.md
then type:
git commit -m “Created a simple readme file”
and finally:
git status

Fig. 29
Now we have two versions of a README.md file committed.  You can see that your origin/master branch is one commit in one direction, and your master branch is one commit in the other direction.  What will happen when I try to update master from origin/master?  Let’s see!  Type:
git merge origin/master
Fig. 30
Just as you might think, git is flummoxed.  This is essentially Git saying ”You fix it.”  Let’s see what state we’re in.  Type:
git status
Fig. 31

Can’t be any clearer, except for one detail.  Git is telling us to type “git add/rm whatever” to “mark resolution.  That means, in order to fix this, you could take one of two routes.  DON’T DO THIS! … You could go into README.md, fix it up, then stage it with git add.  Edit the README.md file.  I’ll use vim, but you use whatever editor you want:
Fig. 32
You can see that the two versions are marked very clearly.  HEAD represents “the current local branch you’re on”, which is master.  If you review all the times you’ve typed “git status”, it’s told you that you’re on branch “master”.  And we know “origin/master” is our local copy of the remote repository’s master branch.  I’m going to remove the scary divider lines (e.g. <<<<, ====,>>>>) and replace those two versions of project descriptions with a new one:
Fig. 33
If you ignored my warning and you’re doing this, don’t save!  Just exit out!  But if you were doing this, you could save and exit, then “git add” the file, then “git commit”, to stage and commit.  It’s actually better in some ways, because you’re able to rethink each change, and perhaps reword something like I was about to do for this README file.

However, the reason I told you not to do this is because it’s the hard way, especially for complicated conflicts.  Instead, while still in your project directory, having just experienced a failed merge command, type:
git mergetool
Fig. 34

Mergetool will guide you through each conflicted file, letting you choose which version of each conflicted line you’d like to use for the committed file.  You can see, by default, it will use “opendiff”.  Press enter to see what “opendiff” looks like:

Fig. 35

If this were more than one line, you’d be able to say “use the left version for this conflict line”.  Or “use the right version for this line”.  Or “I don’t want to use either line.”  In this case, we only have one conflicted line to choose from, so make it count!  The one conflicted line is selected.  Click on the “Actions” pull down menu and choose “Choose right”.  You’ll see nothing changed.  That was because that arrow in the middle was already pointing to the right.  Try selecting “Choose left”, then “Choose right” again.  You’ll see what I mean.  Opendiff doesn’t give you the opportunity to put in your own custom line.  You can do that later if you wish.

At the pull down menu at the top of the screen, select “File” then “Save Merge”:

Fig. 36

Go back to the menu and select “Quit FileMerge”.  Now again type:
git status



Select the line then
Let’s stage the new version of the readme file.  Type:
git add README.md
Fig. 37
All set to commit changes, just like if you manually modified and staged (with “git add”) the files yourself.  Now type:
git commit -m “Merged remote version of readme with local version.”
and then:
git status
Fig. 38
Before we go on, if you noticed, there’s a lingering “README.md.orig” file.  That’s just a backup in case the merged file you came up with looks horrible.  However, it’s a pain to deal with these “orig” files.  For this time, you can move the file somewhere, or just delete it, but, check out this page on many strategies you can leverage to deal with those files.

Back to the merge.  Look!  Your branch is “ahead” of “origin/master” by 2 commits.  Let’s see what those commits are.  To show just the last two commits, type:
git log -n 2
Fig. 39
The earlier commit on the bottom is the one you did before, to create your local version of the readme file.  The top commit is the “merge commit”, that Git uses to identify where two branches were merged.  Now review what state “origin/master” is in with “git log origin/master”.  We want our merged version of the readme to Github.  Yes, we’re back on the internet!  Let’s push our changes to origin/master and see what happens.  Type:
git push origin master
Fig. 40
Now, just to be sure, we’re not going to look at the “local version” of the remote branch.  Let’s go right to Github to see what happened.  View the commits in your repository:
Fig. 41
What might not make sense here, is that you have first the Github-side readme commit, then your local readme commit, then the merge.  It doesn’t make sense for all of these commits to happen in sequence, since the first two are conflicting.  What happens is that your local readme file commit is logged as a commit on a separate branch that is merged in.  Let’s graphically demonstrate that by clicking on the “Network” button on the right (circled in red).

Fig. 42
Each dot in this diagram represents a commit.  Later commits are on the right.  The one that looks like it was committed to a separate branch (your local master branch) and then merged in is the commit of your local version of the readme file.  Hover over this dot and see for yourself.

It’s good to pull in remote changes not too infrequently, to minimize the complexity of conflicts.


Rebasing


This is as advanced as this tutorial is going to get, and you’re in the homestretch!  Rebasing is meant to give you that clean, fresh feeling when committing your changes.  With it, you can shape your commits how you prefer before merging them to another branch.  But wait, you might think… You can already do that when you’re staging your files.  You can stage and unstage files repeatedly, getting a commit exactly how you want.  There are two main things that rebasing lets you do in addition to that.

Let’s say you were working on branch A and you created branch B.  Branch B is nothing more than a series of changes made to a specific version of branch A (starting with a specific commit in branch A).  Let’s say you were able to take those changes and reapply them to the last commit in branch A.  It’s as though you checked out branch A and you made the same changes.  Read this paragraph as many times as you need to before you move on.

Remember when I mentioned about fast-forward commits?  When you viewed the commit history on Github, did you like seeing commits on other branches being merged in?  Or would you have preferred one commit after another?  Most prefer the latter.  Merging can get quite messy in a worst-case scenario, but even if it’s not so bad, it’s not preferable.  You can use rebasing to allow your merges to be “fast-forward”, so when you merge your changes into another branch, there’s no “merge commit”.  Your changes are simply added as the next commits in the target branch, and the new latest commit of that branch is your last change.

Let’s demonstrate before I talk about the next benefit.  I explained how to create and switch to local branches at the end of the “Branches” section.  Type:
git branch testing
We’re still in the master branch.  Now let’s make another change to that awful readme file again.  Load up your editor and add the line: “Inspired by the Gershwin brothers” then save:
Fig. 43
If you type “git status”, you’ll see the only modification is to the readme file.  A shortcut I didn’t tell you about, to stage and commit all modified files at the same time, if all the modified files have already been staged once (they’re not “untracked”), is by using git commit’s “a” flag:
git commit -am “Added something to the readme file”
then view the log with:
git log -n 2
Fig. 44
There’s our change, right after our merge commit.  We’re not going to make the mistake of adding any more messy merge commits.  Type:
git checkout testing
and then view the README.md file:

Fig. 45
You see that your modification is no longer there.  I’d have you modify the readme file again, but I think I’m done explaining conflict resolution.  If you did modify readme, and then you wanted to reapply your changes over the latest version of the master branch, you’d have another bloody conflict to resolve.  Let’s just create a change in our source code.

Edit the file “src/main/scala/scottdanzig/potayto/Potayto.scala” and add the printing of “Ding!” as shown.  Please, just humor me…

Fig. 46
Now stage and commit:
git commit -am “Added the printing of Ding”
then show the last two changes for both the current “testing” branch and the “master” branch with:
git log -n 2 <branch>
Fig. 47
There be a storm a-brewin’!  Hang in there!  If you merged the testing branch into master now, you’d again see your change added to the master branch, followed by a merge commit.  Wouldn’t it be simple if we can recreate testing from the current version of master, then automatically make your change again for you?  Then you’d only be adding your “Added the printing of Ding” commit.  You can do just that right now.  Type:
git rebase master
Fig. 48
Git talks of “reapplying commits” as “replaying work”.  How does it know which commits in your current branch to reapply/replay?  It traverses down the branch, starting with the most recent commit, and finds the first commit that is in the master branch.  Now let’s see the log:
Fig. 49
See?  It’s exactly what I described.  It’s as though you waited for that last change to master to be made before branching.  Now see how easy it is to merge in your changes by switching to the master branch and doing the merge:
Fig. 50
A fast-forward merge is so easy.

I mentioned there are two things rebasing lets you do that you can’t do just with staging.  There’s this notion of “interactive rebasing” that I think is the coolest part of git.  This is the last part of the tutorial where you have to do anything, so this is the homestretch of the homestretch.  Now we’re going back to our testing branch (currently the same as master) and create two new files, A and B.  I’m going to keep this simple.  Type:
git checkout testing
then:
echo test > A
and stage and commit that change.  File “A” is new/untracked, so you can’t use the “-am” shortcut:
git add A
and then:
git commit -m “Added A” 
Fig. 51
Now create another file, B:
echo test > B
and stage then commit as well:
git add Bgit commit -m “Added B”
Fig. 52
You’ll see both of those commits in the log:
Fig. 53
Okay, we’re all set to show off interactive rebasing.  We’re going to combine those two commits you just did into one commit.  You have two options:

  • You can do this in the same branch, if you just want to reorganize a branch while you’re working with it.
  • You can also combine commits when you’re rebasing (reapplying/replaying) them onto another branch.

If you don’t think this is the bees knees, you’re nuts.  We’re going to do the rebasing the second way, while rebasing onto master.  The latest change on master is contained in the testing branch, so rebasing just to avoid merge commits would be unnecessary.  Merging testing into master would be a fast-forward merge.  However, we’re also going to use this opportunity to combine the two commits.  Rebasing can be multi-purpose that way.  Type:
git rebase -i master
Fig. 54
Git might scare you with a vim editor window like this.  You see those two “pick” lines at the top? This  is a list of the commits that are going to be reapplied, with the oldest change on top.  If you change an instance of the word “pick” to “squash”, the commit listed on that line will get combined/melded into the older commit above it.  You need the oldest commit you want to reapply to be a “pick”.  You can use “p” and “s” instead of “pick” and “squash” by the way.  If you want, you can even remove some commits from this list all together, but be careful.  That effectively removes all record of that commit from the current branch.  Oh look!  It even warns you in ominous CAPITAL LETTERS.

Let’s change the second “pick” to a “squash”.  It’s possible to change your default editor from “vim” if you want, but if you prefer vim like me or just haven’t got around to it yet, just heed my instructions:
Use the arrow keys to move the cursor to the “p” of the second “pick”.

  1. Type “cw” to change the word.
  2. Type “s” then press the escape button.
  3. Type “:x” to exit and save.

Now you should see a screen allowing you to create the new commit message:
Fig. 55
This gives you the opportunity to write the new description, perhaps multi-lined for the combined commit.  By default, Git will just put all the combined commit messages one after the other.  If you want, you can accept that and just type “:x” to exit and save.  Or, you can use vim to modify the file to your liking.  If you want to give it a shot, just press “i” to go into insert mode, then use the arrows to move around and backspace to delete.  When you’re done, press the escape key then type “:x”.  Here’s my modified file:
Fig. 56
I could have ignored the lines starting with #, but I got rid of some of them for clarity.  Here’s what the log looks like after I saved and exited:
Fig. 57
See the one big commit?  That “printing of Ding” commit afterward is the latest commit currently in the master branch, so merging the testing branch into master would be a fast-forward merge.  I’d demonstrate that, but I’d rather avoid redundancy and finish the tutorial.


Pull Requests


Commits are often grouped into “feature branches”, representing all the changes needed for a branch.  How projects with designated maintainer(s) often operate is as follows:

  • You push your “feature branch” to a remote repository, often your fork of the main repository.
  • You create a “pull request” on Github for that branch, which tells the project maintainer that you want your branch merged into the master branch.
  • If the branch is recent enough where it’s spawned from the most recent commit on the project’s master branch, or it can be rebased onto master without any conflicts, the maintainer can easily merge in your changes.
  • If there are conflicts, then it’s up to the maintainer to do the merge, or to reject the pull request and let you rebase and deconflict the commits in your branch yourself.


New York Magazine Development Environment


At New York Magazine, where I work, we generally have 4 main branches of each project entitled dev, qa, stg, prod.  We have software called Jenkins that monitors each branch, and when any change is made, the project is redeployed to a computer/server dedicated to that environment.  
  • dev branch - While developers first test their code on their own computers, eventually they need to test changes on a server with shared resources.  This often exposes a bunch of integration issues so often a change requires multiple commits (multiple attempts to get it right) before the change is complete.  It’s a necessary evil that developers simultaneously make changes in this environment for their own features. Hopefully, someone else’s changes don’t affect testing of your own changes.
  • qa branch - This is branch is for QA (quality assurance) testing to be done on a new change.  The branch is cleaner, only having completed changes, and, although everything isn’t necessarily optimized (maybe you do have debugging information being recorded to the log, for instance), it’s much more controlled as opposed to dev.
  • stg branch - Changes approved by QA go to the “staging” environment.  This environment to be fully optimized, as if it were the production environment.  There could be more issues that are exposed by testing in a fully optimized environment, but usually not.  This is not to be confused with the much lower-level staging in git, but ultimately, the concept is the same.  You’re ultimately preparing a set of features that are slated to go public, rather than a bunch of file changes that are about to be committed.
  • prod branch - What your clients/customers/users ultimately see is deployed directly from this branch.
To manage the environment-specific configuration, including enabling optimizations and altering logging levels, we use Puppet.  We also use Git to maintain our internal documentation, written as text files using the Git-variety of Markdown, to allow ease of collaboration and code-friendly formatting.  Hosting a project on Github is free unless it’s to be private.  Most New York Magazine repositories are private.

Each commit message at New York Magazine, optimally, should have a “story number”.  A “story” is a description of a desired modification.  If something should be changed in code, someone describes how the change works in a web interface provided by a story-tracking application such as Atlassian’s JIRA, which we use.  A developer can modify the “status” of the story to reflect progress being made toward its resolution.

We use something called Crucible for “peer code reviews”.  This lets a developer send a series of commits out to fellow developers to have a look at.  It tracks who has had a change to review your code, and gives them the opportunity to make comments.



My Preferred Workflow with Git


I’m typically tasked with a modification I must make to a shared project hosted as a Github repository as I described.  On Github, I have a separate user, “scottdanzig” for my job-related Github activity, which allows clear separation of my personal projects from what I’ve done that is New York Magazine property.  For my examples, I’ll refer to a web application created with Scala and the Play Framework, that provides restaurant listings for your mobile device.  Let’s say we realized that the listings load very fast, and we can afford to display larger pictures.  Here is my preferred workflow:


Changing the Code



  • First thing I do is change the status of the JIRA story I’m going to work on to “In Progress”.
  • If I don’t yet have the project cloned onto my machine, I’ll do that first: git clone https://github.com/nymag/listings.git
  • I checkout the dev branch: git checkout dev
  • I update my dev branch with the latest from the remote repository: git pull origin dev
  • I create and checkout a branch off dev: git checkout -b larger-pics
  • I make my modifications and test as much as I can, staging and committing my changes after successfully testing each piece of the new functionality.
  • I’ll then update my dev branch again, so when I merge back, hopefully it’s a fast-forward merge: git pull origin dev
  • I’ll interactively rebase my larger-pics branch onto my dev branch.  This gives me an opportunity to change all my commits to one big commit, to be applied to the latest commit on the dev branch: git rebase -i dev then I change all “picks” but the top one to a squash.  I write one comprehensive commit message detailing my changes so far, making sure to start with the JIRA story number so people can review the motivation behind the change.  It’s possible I might want to not combine all my commits yet.  If I’m not sure if one of the incremental changes is necessary, I may decide to keep it as a separate commit.  This is possible if you leave it as a separate “pick” during the interactive rebasing.  Git will give you an opportunity to rewrite the commit description for that commit separately.
  • I checkout the dev branch: git checkout dev
  • Then I merge in my one commit: git merge larger-pics
  • Then I push it to Github: git push origin dev
  • If it complains about it not being a fast-forward merge and rejects my change, I may need to rebase my dev branch onto origin/dev and then try again.  We’re not going to combine any commits, so it doesn’t need to be interactive:  git rebase origin/dev then again: git push origin dev
  • Jenkins will detect the commit and kick off a new build.  I can log into the Jenkins web interface and watch the progress of the build.  It’s possible the build will fail, and other developers will grumble at me until I fix the now broken dev environment.  Let’s say I did just that.
  • If I think it might be a while before I’m able to fix my change, I’ll use ”git revert <SHA code>” to undo the commit then quickly push that to dev.  Either way, I’ll again checkout my larger-pics branch, git rebase dev, then make changes, git pull origin dev, git rebase dev, git checkout dev, git merge larger-pics, git push origin dev.  Let’s say Jenkins gives me the thumbs up now.
  • Next stage is the code review.  I’ll log into Crucible and advertise my list of commits in the dev branch for others to review.  I can make modifications based on their feedback if necessary.



Submitting to QA


Let’s say both Jenkins and my fellow developers are happy.  It’s time to submit my code to QA.  The QA branch is automatically deployed by Jenkins to the QA servers, a pristine environment meant to better reflect what actually is accessed by New York Magazine’s readers.  We have some dedicated QA experts to systematically test my functionality to make sure I didn’t unintentionally break something.  If there are no QA experts available, QA might be done by another developer if the feature is sufficiently urgent.
  • I need to update my local QA branch so I can rebase my changes onto it, pushing fast-forward commits.  I first type: git pull origin qa
  • Then I change to my larger-pics branch: git checkout larger-pics
  • It’s time to rebase my commits onto the qa branch, rather than dev, which can be polluted by the works in progress of other developers.  I type: git rebase -i qa, creating a combined commit message describing my entire set of changes.  I now have a branch that is the same as QA, plus one fast-forwardable commit that reflects all of my changes.
  • I add my branch to the remote repository: git push -u origin larger-pics
  • I go to the repository on Github and create a pull request, requesting my larger-pics branch be merged into the qa branch.



The Project Maintainer


At this point, it’s out of my hands, for the time being.  However, the project has a “maintainer” assigned.

  • The maintainer can first use the Github interface to see the changes.  The maintainer can give a last check for the code.
  • If approved, the maintainer must merge the branch targeted by the pull request to the qa branch.  If the commit will have no conflicts, Github’s interface is sufficient to merge in the change.  Otherwise, the maintainer either can reject the change, requesting for the original developer of the change to rebase the branch again and resolve the conflict before creating a new pull request.  Otherwise, the maintainer can checkout the branch locally and resolve the merge, rather than the original developer doing it.
  • The maintainer commits the merged change and updates the JIRA story to “Submitted to QA”.
  • If QA finds a bug, they will change the JIRA status to “Failed QA”.  The maintainer will checkout the QA branch and use “git revert” to roll back the change, then will reassign the JIRA ticket back to the original developer.
  • If QA approves the change however, they will change the JIRA status to “Passed QA”.



Release Day


At regular intervals, a development team will release a set of features that are ready and desired.  A release consists of:
  • A developer merging QA-approved changes from the QA branch to the staging branch.
  • Members of the team having a last look at the change’s functionality in the staging environment.
  • The developer of a change, after confirming that it works correctly in staging, merges the change into the prod branch before a designated release cutoff time.
  • The developer changes the status of the JIRA story to “Resolved”
  • The system administrators will deploy a build including the last commit before the cutoff time.  For New York Magazine, this entails a brief period of down-time, so the release is coordinated with the editors and others who potentially will be affected.


What’s Not Set in Stone


That’s a summary of how I work, and although everything is “sensible”, it’s a bit in flux.  These are things which could be changed:

  • We can get rid of the staging environment, and merge directly from QA.  I see the value in this extra level of testing, but I believe four stages is a bit too cumbersome.
  • A project does not necessarily need a maintainer, and if we use Crucible, perhaps not even pull requests.  A developer can merge his change directly into the QA branch and submit the story to QA on his/her own.  I prefer to have a project maintainer.
  • We can get rid of Crucible, and just use the code review system in Github.  It might not be as feature-filled, but if we use pull requests, it’s readily available and could streamline the process.  I like Crucible, although it might be worth exploring eliminating this redundancy.



Conclusion


After years of using many other version control systems, Git has proven to be the one that
makes the most sense.  It’s certainly not dependent on a reliable internet connection.  It’s fast.  It’s very flexible.  After over 20 years of professional software development, I conclude Git is an absolutely indispensable tool.

BuddyChat, a Simple Example of Akka Actors With an Akka FSM

I wrote a silly chat program in Scala to demonstrate functionality provided by something called “Akka”, which is available for both Scala and Java.  It lets you easily write your program so that:

  • Everything is asynchronous, which means that things can be run at the same time when possible.  If two people want to say hello, let them say hi at the same time.  Those “hello” messages will be be displayed when they arrive, in whatever order.  Being able to run things in parallel is massively important when a computer has more than one processor, or some things might have to wait to complete, such as, perhaps, searching for a Wikipedia article.
  • Asynchronous programming is safe.  With “lower-level” implementations, it’s very easy to screw up, and your software, although perhaps faster, will be prone to crashing or generating erroneous results.

Without further ado, let me introduce you to BuddyChat!  It’s ugly and it’s silly, but it’s educational.  For people who want to see it on github.com, it’s publicly available here:

https://github.com/sdanzig/buddychat

And here’s a sample test run:



Description


The “gist” of this is that you’re participating in a chatroom.  You run BuddyChat.  BuddyChat creates the manager of the chat.  This manager will create all the participants, both automated and human.  The one human participant it creates represents you and will provide you an interface to make it speak in the chat room.  Whenever a participant speaks, the message goes to the chat manager who forwards the message on to the other participants.

There are a couple other little features I’ll describe later, but that’s the brunt of it.  Here’s a diagram showing this:
Aside from the slightly different names, and a couple of the messages, it’s exactly as described.  The arrows represent both the actual construction of the objects, and also sending messages between them.


Construction


  1. The BuddyChat object is automatically created when the application is run.
  2. The BuddyChat object builds ChatManager.
  3. ChatManager builds the three BuddyActors (the automated chat participants)
  4. ChatManager builds UserActor.
  5. UserActor builds ConsoleActor, which accepts input from you.

Messaging


  1. BuddyChat starts off ChatManager with a CreateChat message.
  2. ChatManager receives CreateChat, then constructs the participants.
  3. ChatManager starts off all participants with a Begin message, which all but UserActor ignores.
  4. UserActor starts off ConsoleActor with an EnableConsole message.
  5. ConsoleActor sends each line of text you type as a MessageFromConsole message to UserActor.
  6. UserActor will send this text in a Speak message to the ChatManager.
  7. ChatManager will record the Speak message to its history, then forward it onto the BuddyActors.
  8. In response, each BuddyActor generates and sends a new Speak message to ChatManager.
  9. ChatManager will record each Speak message to its history, then forward them to the other participants. The BuddyActors will see the new messages are not from a human and will ignore them. The UserActor prints out the message to the screen.

There are also other messages UserActor can send ChatManager:

  • KillChat - Shut down the chat application.  Generated when UserActor receives “done”.
  • StopChat - ChatManager will clear its chat history and stop accepting Speak messages.  Generated when UserActor receives “stop”.
  • StartChat - ChatManager will resume accepting Speak messages. Generated when UserActor receives “start”.

ChatManager’s Finite State Machine (FSM)


I love finite state machines.  Let me explain what it is:

Something can be in just one out of a set of states.  When in a particular state, it behaves a particular way. When a particular condition is met, it can transition to a different state.

That’s it.  They make it very easy to model potentially complex software.  Just think of what your possible states are, and what it takes to get from one state to another.  I implemented ChatManager as a finite state machine.  The states it can be in are:

  • ChatOffline
  • ChatOnline

By default, ChatManager is in the ChatOffline state.  Upon receiving the CreateChat message, it transitions to the ChatOnline state.  Receiving StopChat and StartChat messages will cause ChatManager to transition to ChatOffline and ChatOnline, respectively, if not already in the target state.

Given this, there’s a negligible hiccup that occurs because, in response to a CreateChat message.  ChatManager will create a UserActor, then can send it a Begin message just before transitioning to ChatOnline.  What this means is, for a very short but existent period of time, the UserActor can send a Speak message while the state is still ChatOffline, which would consequently get ignored.  Akka provides you a way to specify something to occur during a particular transition.  In this case, ChatManager sends out the Begin message on the transition from ChatOffline to ChatOffline.


Finite State Machine Data


Okay, I lied, there’s one more complexity to at least Akka’s version of FSM, which I used.  Akka works very cleanly if you adhere to the design and don’t use anything that’s “shared”.  By this, I mean you’re not supposed to let things write information/data to the same place at the same time, or even read the same data if it could change at any point.  The way Akka actors (which all those Actors mentioned before, plus ChatManager are) can safely communicate are through messages.  Just like the Begin message wasn’t sent out until in the ChatOnline state, it’s possible to also ensure that a piece of data changes at the same time the state changes.  ChatManager uses this data-handling to manage its list of chat participants, and the chat history.


The Code


The source code for BuddyChat is available at:

https://github.com/sdanzig/buddychat

To start, we’ll look at the first thing that does something…

The BuddyChat Object


The first line shows how, in Akka, an actor is created.  “manager” is a unique name you can use to refer to the actor later.  It’s not meant to look pretty, and adheres to a number of restrictions, such as having no spaces, but I use it for display purposes in this demo so I don’t have to bother with storing a more visually appealing name.  The second line is sending a basic message to ChatManager, to tell it to get things started.  It’s quite possible to send just Strings as messages, such as:

manager ! "create chat"

However, by having a specific message class “CreateChat”, the compiler can warn you about typos.

ChatManager


ChatManager starts off as follows:

ChatManager inherits the functionality of Akka’s Actor, and it’s given the FSM trait, which allows it to operate as a finite state machine.  The number of automated participants is controlled by this hard-coded constant.  ChatManager is initialized as being in the ChatOffline state, and with no users and no chat history.  Not even empty lists, which is why it’s simply Uninitialized.

Akka’s structure for handling messages when in a state is quite intuitive.  It follows the paradigm: “When in state A, handle messages of type 1 this way and messages of type 2 that way.”  See ChatManager’s logic in the ChatOffline state:

As you can see, when offline, ChatManager can handle a CreateChat message and a StartChat message.  I won’t dive too much into how case classes work in Scala, but I will point out that you don’t just see “case CreateChat” here.  You see “case Event(some message type, some state data)”.  This is being used not only to respond to a particular incoming message, but also to read in the state data.  It’s possible to also have it respond to a message type differently depending on what your data is.  In this case, we know we only want to respond to CreateChat messages when the data is Uninitialized, so we specify this.  This ensures that if we erroneously get a CreateChat message after the chat has been created, the message will be ignored, because although the message type matches CreateChat, the state data does not match Uninitialized.

Upon reception of CreateChat, ChatManager instantiates the sole UserActor, named “user”, and the three BuddyActors.  The combination of the two,

user :: list

becomes the new state data upon transitioning (going to) the ChatOnline state.  CreateChat is one message, when there is no state data, that can provoke this transition.  The other is StartChat, but only if the chat participants are already created.  That stipulation is reflected by ChatData(chatters, _).  The underscore is a placeholder for the chat history, used to convey indifference to what, if any, chat history exists.  Checking the list of chatters alone is sufficient to ensure StartChat is processed only when it should be.  Upon processing a StartChat message, ChatManager will transition to the ChatOnline state, retaining the list of chatters, and creating a new, empty chat history (List[String]()).

As mentioned before, ChatManager has some logic for immediately after transitioning from offline to online, to avoid the window of time when a UserActor can send a Speak message when ChatManager is still offline (and thus being ignored):

While the automated BuddyActors ultimately ignore the Begin message, because they only send messages in response to the user anyway, the UserActor, upon receiving a Begin message, will instruct the ConsoleActor to start receiving keyboard input.  One more quirk here.  This part:

(Uninitialized, ChatData(chatters, _)) <- Some(stateData,nextStateData)

What that is doing is ensuring the Begin message is only sent out when the chat participants are first created.  The state data goes from completely uninitialized to existing state data complete with a list of chatters.  If the change in state data doesn’t match that, then nothing happens during the transition.

While in the ChatOnline state, ChatManager uses this message handling logic:


In this state, ChatManager can now accept Speak messages.  Upon receiving a Speak message, ChatManager will forward the message to all chat participants except (different from) the sender, which is where the “diff” is applied.  “forward” is used to re-send the messages rather than the typical ! because forward will send the message as if it were from the same “sender”.  Akka allows you to, upon receiving a message, access the sender of that message, and if ChatManager used !, it would appear that ChatManager originated the message.  This allows the message receiver to handle a message in a different way based on who sent it.

When writing BuddyChat, I initially allowed BuddyActor to respond to all incoming messages, but ultimately the problem arose where all the BuddyActors responded to other BuddyActors repeatedly and endlessly.  By only responding to messages where the sender has the name “user”, the BuddyActor is assured to avoid this issue.

Note Speak does not cause a transition.  ChatManager will “stay” at its current state.  However, it uses updated state data (ChatData) which has a chat history that includes the new message.

ChatManager also can receive a StopChat method while in ChatOnline state.  This will cause ChatManager to go to “ChatOffline” state, and while the list of chatters are preserved in the new ChatData, the chat history is replaced by an empty list of messages.

When there is no case that matches the message in the handler for the particular state, the message is dealt with in the whenUnhandled block:

In either state, ChatManager should be able to handle the KillChat message, so it makes sense to receive it here.  While whenUnhandled certainly can deal with messages that are unexpected in the current state, the fall-through logic that leads messages to whenUnhandled makes it a perfect place to handle messages that are treated the same in any state.  ChatManager does not have to clean up any resources upon shutdown, so it can call context.system.shutdown to end the application immediately.  Just for demonstration’s sake, ChatManager prints out the entire chat history first, summarizing who said what.  Note that when ChatManager stores text from Speak messages, it prepends the name of the actor that generated the message.

If a message is actually unexpected, there is a catch-all handler that will log the message with current state data as a warning, but otherwise do nothing.

UserActor


A UserActor is constructed by ChatManager when it receives a CreateChat message.  Upon creation, the UserActor will create a ConsoleActor.  Very soon after UserActor is created, ChatManager will enter ChatOnline state then pass it a Begin message.  UserActor is not a finite state machine.  It will respond to the same set of messages the same way no matter the circumstances.  The messages are handled by UserActor’s receive method:

Upon receiving a Begin message, UserActor sends an EnableConsole message to ConsoleActor it created.  If the UserActor tried to wait for user input directly (which I initially tried to do), it would not be able to receive any further messages.  Why is this?
An actor in Akka has a message queue which is processed one message at a time.  Waiting for keyboard input is a “blocking” operation, which means that execution ceases until keyboard input is received.  Because you need to repeatedly wait for the next line of input in a loop, the Begin message handler would never exit.  It would just repeatedly end up waiting for keyboard input.
The solution is to let ConsoleActor handle it.  If ConsoleActor receives one message and then endlessly waits for user input, this is okay, because it’s running in another “thread of execution”.

UserActor, after enabling the console input, will wait for an incoming MessageFromConsole.  If the text encapsulated by this message is one of the following, there is special handling:

  • “done” - Upon receiving this, UserActor will send ChatManager a KillChat message to shut down the chat system.
  • “stop” - UserActor will send ChatManager a StopChat message to disable the chatting and clear the chat history.
  • “start” - UserActor will send ChatManager a StartChat message to re-enable chatting.

If the text does not match any of those, UserActor will encapsulate the text in a Speak message and send it to ChatManager, allowing the user to communicate with the other chat participants.

From the ChatManager, UserActor can receive Speak messages which would have originated from other chat participants (BuddyActors) and then been forwarded by ChatManager.  Because the Speak message  was forwarded rather than resent, the sender is the actor that generated the message, not the ChatManager that directly sent it to the UserActor.  This allows the UserActor to pull out the originator’s name to identify the sender of the message for display purposes (labeledText).

There’s one more nifty thing to mention about this “matching” methodology in receive.  Later you’ll see the declaration of the messages that are passed around between actors.  If all of the messages that an actor can receive have a “sealed trait”, then whenever you are handling a message with this trait, Scala can confirm that you have handled every possible message that has this trait.  This is called “checking for completeness” in a pattern match.

ConsoleActor


ConsoleActor’s sole purpose is to accept input from the keyboard and send it to the UserActor in a MessageFromConsole message.

It receives one message, EnableConsole, and then displays instructions enters the loop that accepts lines of input from the keyboard.  For each line, a MessageFromConsole message is sent to the UserActor, which ConsoleActor identifies as its “parent”, since UserActor created it.  The only thing that can exit this loop, is when “done” is typed.  That fancy getLines.takeWhile is generating a “stream”, which is a feature in Scala.

A stream can be iterated over just like a list, and each element is generated on the fly.  The takeWhile, upon detecting a value that doesn’t meet a condition, will make the for loop think it’s just reached the end of the list, instead of processing the nonconforming value from the stream.

After the loop has terminated, “done” is sent to UserActor to shut down the chat.

BuddyActor


BuddyActor’s sole purpose is to respond to messages from UserActor.

testJust to make BuddyActor interchangeable with a fully functional UserActor, BuddyActor handles all the messages that a ChatManager might send to any other chat participant, such as UserActor:  Speak and Begin, although it will ignore the Begin message.  In the Speak message handler, the message is also ignored if the sender’s name is not “user”.  This prevents a BuddyActor from endlessly conversing with another BuddyActor.  When responding to a Speak message, BuddyActor will randomly generate one of three silly responses, including the text from the received message in the reply.  This inclusion is mainly to prove that BuddyActor is successfully receiving the forwarded message from the UserActor.

The random number generator, “rand”, uses a “seed” that is affected by the current time in milliseconds along with the “path” of the actor, which must be unique amongst actors.  Without a random number seed, the random number generator would generate the same sequence of numbers.

Messages


The messages passed between actors are defined as follows:

The messages are given traits such that all possible messages an actor can receive share a common trait. If you accidentally remove the handling for a message in that set, the Scala compiler will warn you.  The “sealed” keyword means that all the possible classes that use that sealed trait are in the same file.  This allows the programmer to guarantee that all messages which use the trait are accounted for.

There are two subtleties used here while defining these traits:


  • A message class can have more than one trait.  The Speak message is a ChatParticipantSystemMessage and a ChatManagementSystemMessage.  That means, respectively, that it’s one of the messages that a chatter can receive, and also one of the messages that ChatManager handles.
  • A trait can have another trait.  By saying that a ChatParticipantSystemMessage has the UserSystemMessage, you’re saying that the set of messages with the UserSystemMessage trait is equal to or greater than the set of messages with the ChatParticipantSystemMessage trait.  Any message with the ChatParticipantSystemMessage trait also has the UserSystemMessage trait, so the set of user system messages are at least that set, and perhaps more.  In this case, there’s one additional message that a UserActor can receive that the other chatters (the BuddyActors) can’t receive.  The MessageFromConsole is from ConsoleActor.  ConsoleActor only communicates with UserActor, so this makes sense.


Conclusion


The BuddyChat system certainly is only meant to serve educational purposes.  However, it demonstrates many very useful technologies within both Akka, and Scala as well.  Programmers no longer need to fear multi-threaded programming as long as they properly use an actor system such as provided by Akka.  Akka’s FSM can simplify the implementation of a complex system by grouping its behaviors by its possible states.  While the significant overhead of the actor framework is not suitable for applications requiring maximum performance (such as handling billions of tweets or time-sensitive stock ticker updates), in which case lower-level handling of parallel execution is recommended, Akka actors are amazingly easy to deal with and should be used otherwise.


Notes


I’m unsure if there’s a way to do some form of completeness checking in the FSM handlers.  I’d guess not, but please let me know if there’s a way.
I tried implementing the check for the name “user” as a guard in the pattern match:

case Speak(msg) if "user".equals(sender.path.name)

I had a case Speak(msg) after that to catch the other Speak messages and ignore them.  However, this disabled the completeness checking.  I saw in older versions of Scala that guards were handled improperly and this had been fixed, but perhaps the change was reverted, or, more likely, I’m doing something wrong.

The Optimization of Java’s HashMap Class

Yesterday I was on Quora.com sifting through some Q&A and I ran across someone describing an optimization that was made in Java’s HashMap class, according to the poster around version 1.4.  It was simple, yet it amazed me.  I didn’t understand how it could work at first, but with a little digging, I figured it out, and it’s simple yet very clever.  First I’ll briefly explain some of how a hash map works, for the laymen (I think everyone can understand most of this), then I’ll go on to describe the change.


Description of What a Hash Map is

A hash map is a way a computer can store a set of things in memory for quick access.  Picture I had a function that took a word, let’s say “bird”, and converted it to a number, such as 7.  As long as you had the same input, you’d always get the same output.  So when you wanted to access a bunch of information with the label “bird”, you can find it in storage bin number 7.  You only have to look in one bucket, so it’s super-fast.

Optimally, your function will have a unique number for every unique word.  Ideally it would, but the function might not be perfect, and if “bird” and “potato” both produce a 7, then if you want to look up either, you might have to check two spots in memory instead of one, which takes longer.  This is called a “collision”, and you want a function that avoids them as much as possible.

Now, it’s true that if you had a billion words, it’s unrealistic that your computer could have a billion separate spots in memory to hold it.  But your function produces unique numbers for nearly all of them, so you ultimately want the hash map to have a place for each number.  What a HashMap will do is take the number of spots in memory it DOES have (let’s say 16), and divide the number output of the function by it, and use the remainder instead.  This is the “modulo operation”, represented by the percent (%) sign.  That way, you’re never trying to put something in a memory location that your hash map can’t support.  So if your function said “banana” should go in spot 39, then you’d see 39 % 16 = 7.

Certainly you’re going to have plenty of collisions, but there are a couple of key optimizations that can be made.  First, you want the function to spit out numbers that are as evenly distributed as possible, so you don’t have a bad scenario where you’re searching through most of the words because they all resulted in, for instance, the number 7.  There are formulas provided by others who have thought this through already, so just use those.  Secondly, when the hash map gets too full, it will increase the number of spaces available, and move all the old words to their new locations based on the new number of “spots”.

Just so you can talk the talk, the spots in memory a hash map has available are called “buckets”.  The function that converts words to numbers is called a “hash function”.  The numbers are called “hash codes”.  The words are called “keys”, and the “bunch of information” attached to a key is called a “value”.


Java’s Hash Map Optimization


The above modification shows the change, but it’s dependent on a couple of other behaviors of the Java hash map.  First, I’ll review what’s going on.  I mentioned how modulo is used to determine what bucket a specific hash code maps to.  This is replacing that modulo with a “bitwise AND”.  I’m not going to review too much about binary here, but it’s all 1’s and 0’s instead of 0-9 like the base-10 (decimal) numbers you’re used to.  So if you have 1 & 1, you get 1.  But if either or both is a 0, you get 0.  Picture converting the hash code and the number of buckets to a bunch of 1 and 0 “bits”, then doing this AND operation on each bit, from right (least significant/smallest) to left.

If you think about it, you might wonder how this works, because it’s not the same thing as modulo.  If you have 5 buckets, you’re converting 4 to binary – 100 (google “4 in binary”).  That means whatever your hash code is, only the third bit will matter because the other bits will be ANDed to 0.  indexFor will always output either a 0 or a 4.  That will be a crazy amount of collisions.

First Trick


There will never be 5 buckets.  Java’s hash map implementation, when expanding, multiplies the number of buckets by 2.  You’ll always have a power of 2 (1, 2, 4, 8, 16, etc).  When you convert a power of 1 to binary, you only have one bit as a 1.  When you subtract 1 from a power of 2 and convert that to binary, that bit is a 0, and all the bits to the right of it are 1.  Try ANDing a bunch of 1’s with anything and you get the modulo of what those 1’s represent in decimal plus 1.  ANDing bits like this is much faster than doing a modulo, which requires division and subtraction.

Second Trick


There’s also a concern which might not be obvious, but if you are relying on only the smallest bits of your hash code, you can easily get an uneven distribution of keys in your buckets unless you have a really good hash function.  What Java’s hash map implementation does is to “rehash” the hash code.  Check this out:


This scary thing takes your mostly unique hash code and randomizes it in a way that has a relatively even distribution in the “lower bits”.  For curiosity’s sake, I’ll mention that the >>> is shifting the bits in your hash code to the right… so if you had a 4, or 100 in binary, and you did 4 >>> 2, you’d end up with 001, because it’s been right-shifted twice.  The ^ is an “exclusive OR” operation, which is similar to the AND operation, but it outputs a 1 if the two bits are different (one’s a 1, and the other is 0).  Essentially this thing is ensuring that the more significant bits in your hash code are affecting the least significant bits that you’re ultimately going to use to choose each bucket.


Hope you found this all as righteous as I did!

The Option Design Pattern

When starting to learn Scala, front and center was the utility of the Option design pattern.  I think it’s useful, but at first it’s fairly unwieldy, and it’s much more useful when you know how to reasonably work with it.
The problem the Option pattern attempts to solve is the frequency of the NPE (null pointer exception), certainly a constant thorn in the side of every Java programmer. The problem is that you’re mixing the valid range for a value with something that is invalid, “null”. By allowing this combination, anyone who uses your provided value must be aware and provide accommodation for the possibility that this “I am invalid!” placeholder can occur.  There are no safeguards indigenous to a language such as Java, so if you forget to handle this, your software can exhibit an error at runtime.
The Option pattern removes the placeholder from the range of possible values by wrapping it in an “Option” object. This object can be one of two derivative classes: Some, or None.  If it’s a “Some” object, it has a value that’s guaranteed to be in the valid range.  If it’s a None object, it represents the “absence of a valid value”.
Scala cleanly handles Option objects via pattern matching.  For example, talkAboutValue is a simple method that takes an Option object, and displays a value if it’s something, and doesn’t attempt to display a value if it’s nothing:

Scala displays a compile-time warning if None is not handled in talkAboutValue’s pattern matching, encouraging comprehensive handling of the input. The input for the talkAboutString method is not an Option, so it’s expected that the input will be a valid String.  This allows the programmer to confidently call the length method, without worrying about an NPE.  You can still input a null to talkAboutString, and handle the null value the Java-way (e.g. if(str==null) {… ) but Scala discourages this.
A very inescapable instance where Scala makes use of the Option object is with its implementation of Maps.  When you query a value, it’s either in the map or it’s not.  If it’s not, rather than returning a null value like you may have been accustomed to in a language like Java, Scala returns a value of None.

Unfortunately, even for a lookup resulting in a valid value, you still have the Option object “wrapper” to deal with.  That means you have to do not only the “get” for the lookup, but another to get the actual value. An “if(val != null)” seems much easier than doing a pattern match every time.
Fortunately, Scala alleviates this via facilities in its core API. The most obvious is “getOrElse”.

This doesn’t alter the handling of the value if a lookup is unsuccessful, but it does at least provide a default value, which may be appropriate in many situations such as a default configuration setting. Another very common use case is handling a list of Options resulting from iterating over a collection:
It’s not so bad in this single-use instance, but if you plan on using a collection of Options repeatedly, you may wish to preprocess it, to remove the wrapper objects:

Flatten results in a list of just the unwrapped values in the Some objects.

Flatten is the same method that can concatenate the lists in a list to a single list. E.g. List(List(1,2),List(3,4)).flatten results in List(1,2,3,4). If you care, the way flatten is able to operate on a list of Options is because of an implicit method, option2Iterable. An option can be converted to a list of 0 or 1 elements (0 for None, 1 for Some) with its toList method. This implicit method is called by flatten, resulting in the same treatment as a list of lists:
implicit def option2Iterable[A](xo: Option[A]): Iterable[A] =
xo.toList
Scala, despite its infamously steep learning curve, is beautiful in the way that its implicits provide for such ease of use for common programming tasks. Along those lines, it’s also common for one to have a list of values that must be passed to a method which results in an Option. If you only want to handle the iterations with successful results, such as values successfully retrieved from a map, there is another accommodation called “flatMap”, which replaces a sequence of flatten then map.

You can see that the function passed to flatMap is only run on the last names that are successfully found in the map.

Two more nifty things to know.  If you pass a Java object to the constructor of Option, it will automatically wrap non-null values with Some() and null values are replaced by None. For instance, try going to your Scala REPL and type:
Option(null)

Now try:
Option(3)

And lastly, just to demonstrate how easy working with Options can be, please check out Tony Morris’s nifty cheat sheet.  If you notice yourself handling an Option in a certain way, there’s a very good chance Scala, or at least the scalaz library, provides you a shortcut you can replace it with:

Conclusion
In summary, I agree the Option pattern seems to get in the way when you’re starting out with Scala, but in the end, it results in much safer code, free from one of the most ruthlessly frequent runtime errors of our time. Scala makes it not only available, but also easy to work with, so learn to love it!

Self Introduction

Hello internet people,
I’ve been a computer programmer since my parents bought me an Apple IIc when I was 7.  I’m 36 now, and my path in life, since then, went like this:
  • ProDOS BASIC, for choose your own adventure games on the Apple IIc
  • Labview for Windows, at an AT&T Bell Labs internship
  • MudOS LPC, when I ran my own MUD
  • Java, for developing a legend of Zelda-like game for Software Engineering class
  • Perl, for an AT&T e-billing system
  • Perl, PL/SQL and Java, for the backend functionality at Register.com
  • J2EE, for enterprise-networking software at Avaya
  • C, Java, Visual C++, and C#, for simulation software at CSC
  • Visual C++ and Java, for operational awareness software at Viecore FSD and then Future Skies
  • Java and C, for flight simulation software at MIT Lincoln Laboratory
  • Javascript/CSS for front end and Java with Spring for backend, transferring and accessing broadcast video at Reuters
  • Python, Flex/Actionscript, and Java, for portfolio accounting software at Hedgeserv
  • Scala, Java, Python, and Objective C, for web and mobile publishing at New York Magazine
At this point in my career, I have three main interests that serve as the motivation for this blog:
  • I am invigorated by the recent push in the world of software development toward functional, non-blocking programming.  In particular, I’ve absolutely loved learning about Scala recently, and plan to work towards greater expertise leveraging its capabilities.
  • I have become more attuned to the need for adherence to established design patterns, and want to better familiarize myself with them, in an effort to better exploit their capabilities given an opportunity.
  • It is difficult to stay abreast of all immediately-relevant open source technologies.  For nearly any challenge that exists in programming, I’ve found that someone has developed some sort of solution that will at least partially alleviate its inherent difficulties.
So, in general, that’s what I’ll be posting about.  Nice to make your acquaintance, don’t be a stranger, and feel free to point out anything I wrote that you think is good, bad, right, wrong, misguided or silly.
Thanks for reading!