Fear of Starting

I’m not going to take my hands off my computer until I’m done with a draft of this blog post.

I want to learn how to be unafraid of  starting a job or project or task, moving towards a goal, initiating a transaction, and so on. Easier said than done. So I’m going to do this now. Ride with me. You might just be as afraid of starting as I am. So flow with me. I’m going to envelope all my fears (for now) somewhere in the foggy landscape of Utah (where I currently reside). And I hope that I don’t find it again, after the atmosphere is less fuzzy, in my head and in the troposphere.

As I write this post, I wander if I’m ever going to “finish” it. By “finish”, I mean complete writing a blog post with perfect flow, heavy but sensible content, and impeccable grammar. If I don’t even start, I’ve clearly failed. If I do complete drafting the blog post, then maybe I’ve succeeded at finding a coarse stepping stone to a better blog post in the future. But am I ever going to “finish” this blog post? Logically, I can’t because no one’s perfect, right. Or so they say. Therefore, the only reasonable line of action is to complete the draft and convince myself that my draft is “good enough.” What’s holding me back is perfectionism. My quest for perfection in writing a blog post is not invisible in other parts of my life. Perhaps, if and when I wrap up this blog post, I could be inspired to wrap up the myriad unfinished projects I have, the lingering tasks my head is convinced is still moldable to perfection. But then, Vince Lombardi once said, “Perfection is not attainable, but if we chase perfection, we catch excellence.” Why should I be afraid of the unattainable? Fear of mediocrity. Flush.

I use a task management software called WunderList that currently has more than 100 unaccomplished non-academic tasks created by me. These tasks aren’t heavy-weight. I know that I can finalize each in less than 3 hours if I put my mind to it. Almost all were created just before the start of Fall semester. During the school year, I inexplicably convinced myself that if I try to accomplish any of these non-academic tasks, I will not have enough time to read for the upcoming test in 1 week, to prepare my worksheets for the fast approaching prefect sessions, or to attend the birthday party I was invited to. Bullshit. Now that I think about it, this fear is completely unreasonable. Because I didn’t even try to start any of these tasks. Fear of the unknown. Flush.

Both fears, the fear of the unknown and the fear of mediocrity actually aren’t trivial. H.P. Lovecraft, the inflential author of several popular horror fictions, said that the “oldest emotion of mankind is fear, and the oldest and strongest kind of fear is fear of the unknown.” As a result, fighting these fears, especially the fear of the unknown, is hard. But I’m going to START fighting them by STARTING several tasks.

I’m done! I mean, I’m done writing a draft of this blog post. Is this draft perfect? Clearly not. But it’s a good start. I’ve moved one step closer towards perfection. I’ve buzzing towards infinity (and beyond). I’m more spirited. Most of all, I’ve started!

Cardinality Estimation in Linear Time using Sub-Linear Space

sizematters

The cardinality of a collection A (which might be an ordered or unordered list, a set, or what not) is basically the number of unique values in A. For example, the collections [1,2,3,4] and [1,2,1,3,1,4,3] have the same cardinality of 4 (and also correspond to the same set).

Determing the Cardinality of a Collection: The Naive Approach

Consider a collection A=[1,2,1,3,1,4,3]. How can we systematically determine the cardinality of A? Well, here are two of many ways to do this:

  1. First sort A in ascending order. Then, we can perform a linear scan on A to remove duplicates. It’s pretty easy to see how this can be done. Finally, return the size of the possibly trickled-down collection (now a set) obtained. If the initial size of A is n. Then, the cardinality of A, using this method, can be determined in O(n\, log\, n) (if we use merge-sort) time and O(1) extra space.
  2. Use a hash table: Perform a linear scan of A, hashing the values of A. It’s easy to see that cardinality of A is the number of keys in the hash table obtained. This uses O(n) time but O(n) extra space also.

Notice that we can’t do any better (lower upper-bound) than O(n) because we have to look at the entire input (which is of size n). But can we determine the cardinality of A in O(n) time using sub-linear space (using strictly smaller space than n)?

That’s where probability comes in.

Linear Probabilistic Counting

This is a probabilistic algorithm for counting the number of unique values in a collection. It produces an estimation with an arbitrary accuracy that can be pre-specified by the user using only a small amount of space that can also be pre-specified. The accuracy of linear counting depends on the load factor (think hash tables) which is the number of unique values in the collection divided by the size of the collection. The larger the load factor, the less accurate the result of the linear probabilistic counter. Correspondingly, the smaller the load factor, the more accurate the result. Nevertheless, load factors much higher than 1 (e.g. 16) can be used while achieving high accuracy in cardinality estimation (e.g. <1% error).

Note: in simple hashing, the load factor cannot exceed 1 but in linear counting, the load factor can exceed 1.

Linear counting is a two-step process. In step 1, the algorithm allocates a bit map of a specific size in main memory. Let this size be some integer m. All entries in the bit map are initialized to “0”‘s. The algorithm then scans the collection and applies a hash function to each data value in the collection. The hash function generates a bit map address and the algorithm sets this addressed bit to “1”. In step 2, the algorithm first counts the number of empty bit map entries (equivalently, the number of entries set to “0”). It then estimates the cardinality of the collection by dividing this count by the bit map size m (thus obtaining the fraction of empty bit map entries. Call this V_n).

Plug in V_n and m into the equation r = m\, log_e\, V_n to obtain r which is the estimated cardinality of the collection. The derivation of this equation is detailed in this paper[1].

Here’s my simple implementation of the Linear Probabilistic Counting Algorithm.

Errors in Counting

Although the linear probabilistic counting method is faster than deterministic approaches, it might sometimes fail to be accurate as explained above. So this method should be used only when 100% accuracy is not needed. For example, in determining the number of unique visitors to a website.

Probabilistic Alternatives to Linear Probabilistic Counting

The HyperLogLog algorithm is such an alternative. It also runs in linear time (linear in the size of the initial collection). But the HyperLogLog counter usually gives a more accurate estimation of the cardinality count and also uses less space. See this paper for more details on the HyperLogLog counter.

References

[1] A Linear-time Probablistic Counting Algorithm for Database Applications 

Could Code Rot be Useful in Decision Making?

P7060031

Yes. Maybe. Allowing the codebase of FlyLaTex to “rot” for about 7 months enabled me make better decisions on what features to add and what features to remove. It does seem contradictory. It seems like code rot should always have an adverse effect on your codebase. It should break stuff, right. Well, it depends on how you define code rot. My definition of code rot is a little fine tuned, I must admit. Let’s step back a little.

Defining Code Rot: It’s harder than you think

You’ve probably heard of code rot in different contexts, mostly hinged on the concept of legacy PHP code. Wikipedia defines code rot as the “slow deterioration of software performance over time or its diminishing responsiveness that will eventually lead to it becoming faulty…” [1] Robert Martin wrote an article called Principles and Patterns in which he lists the 4 signs of code rot: rigidity, fragility, immobility, and viscosity. On the other hand, a lot of people are still ambivalent on or confused  about the definitions of code rot.

These definitions are fine. But none of them recognize the power of active waiting (for programmers, of course. Not this) and its influence on feature adoption in software development. Most definitions of code rot just show how bad code rot is, how we recognize code rot, and how we can avoid it. Well, maybe we can harness the ROT. I think I did.

Code Rot in FlyLatex

I started developing FlyLatex for two reasons:

  • I noticed that collaborating on LaTex documents was painful. But no free LaTex collaboration environments were available. So I set out to create one.
  • The HackNY demofest was fast approaching. I knew I couldn’t demo the product I was developing at Trendrr. FlyLatex was a good enough product for the HackNY demofest.

I started developing FlyLaTex around early July and had to finish (for the HackNY demofest) by late July. As a result, I spent less than 4 weeks developing the product. It was 2nd priority (after the product I was working on at Trendrr).

During the development process, I had to make some quick decisions on how the LaTex writing and compilation flow should work. Where should the documents be stored after compilation? Should a user be able to compile LaTex documents remotely or not? How should the compilation errors be presented to the user? I had just learned about Amazon s3. My gut feeling at that instant was to store the compiled PDFs on s3 (via mongoose-attachments) to ease remote hosting. It seemed OK at the time.

This meant that to use FlyLatex, you’d have to have an s3 account. BAD IDEA.

After a couple of minor iterations, I had a working product. FlyLatex v-0.5.0 Beta was born (Beta in this context meant–use at your own risk). My demo at HackNY was pretty successful (except for some minor hiccups, of course). People seemed to like it. I was OK with the product. But no one used the product, I noticed. I got pretty discouraged because of this.

HackNY passed. Sad moment. My internship at Trendrr passed. Fall semester began. I was studying abroad in Budapest during the Fall.

The Hiatus

From the beginning of September to the end of December 2012, I didn’t contribute a line of code to any of my github repositories. I was too engulfed in the rigor of the Budapest program (plus, the wonderful scenary of Budapest and the rest of Europe).

hiatus

This is when the code rot started! It ended a week ago (on the 17th of March).

Harnessing the Code Rot

A week ago, I pondered using FlyLatex to collaborate with a classmate on an abstract algebra project. I tried to run it. It broke, spewing several errors. I was stunned. The last time I ran FlyLatex, it had no problems.

I spent some time trying to trace the main source of the error. It came from the mongoose-attachments.js include. Apparently, mongoose-attachments.js had changed significantly. The developers of the library decided to separate the amazon s3 upload functionality from the rest of the features. That change broke my code.

whatthinking

Then the epiphany came.

I wandered why I decided to use amazon s3 to store compiled PDFs in the first place. What was I thinking?! It’s much easier for the users to store the compiled PDFs in a directory on the current server anyways. [2] I decided to store all PDFs in a subdirectory of the FlyLatex repository by default.

It took me less than 30 minutes to change the source code (and the README correspondingly) so that PDFs would be stored in the local filesystem and not in an Amazon s3 bucket.

After the change, I posted a link to the repository on HackerNews. People seem to like it.

Should you need to make some major (or even minor) decision about your software, maybe all you need is to allow your codebase rot for a little bit.

[1] http://www.cincomsmalltalk.com/userblogs/buck/blogView?entry=3320476400

[2] This is one of the many perils of developing code alone. Had more than one developer been working on FlyLatex with me, I feel we should have to the later conclusion a long time ago.