Arrow of Code

Experience Report: Swift from a Rustacean’s perspective

2018-12-31T23:59:59+05:30

I’ve been using Swift for over a year now, mostly for the decentralized e-commerce platform I’ve been working on (and a bunch of side projects like OpenSSL bindings, WebID-TLS prototype, etc.). So, this is something I’ve been wanting to write for a long time, but never found time until now.¹

A few things about this write-up. Firstly, I love the language!² And second, even though I’ll be comparing bits of both the languages, I’ll never say that one is better than the other. Both are amazing languages and they aim to solve different things. Also, I began using Rust from around its 1.0 release (about 3.5 years back), and Swift only after its 4.0 release (last year), so I’ll be talking about the features they currently have, not what they had / lacked ages ago!

As a side-note, please don’t expect this post to be a tutorial or have in-depth discussions about language internals or have some sort of order in comparisons.

Before we begin, here are a few basic differences between both the languages, just to show you why a direct comparison isn’t fair.

Rust	Swift
Low-level	Not that low-level
Embedded, kernels, browser/game engines, etc.	Web services, apps for iOS, macOS, etc.
No garbage collection	Reference-counted garbage collector
Supports pretty much all platforms	Only XCode and Ubuntu (as of now)
Supports static builds	Foundation (Swift’s stdlib) in Linux can’t be statically linked yet

With this in place, let’s proceed.

Basic syntax

In Rust (being expression-based), semicolons have meaning, whereas Swift doesn’t need semicolons to separate statements (unless they’re on the same line), like Python.

Here’s an example demonstrating a function that simply fetches the value of the given env variable. It’s a common practice - instead of dealing with importing Foundation (stdlib) and calling ProcessInfo everywhere, we have this nice abstraction.

import Foundation

func getEnvVariable(name variable: String) -> String? {
    return ProcessInfo.processInfo.environment[variable]    // explicit return
}

let port = getEnvVariable(name: "LISTEN_PORT")!

If we were to write the same thing in Rust, then we’d do something like:

use std::env;

fn get_env_variable(name: &str) -> Option<String> {
    env::var(name).ok()
}

let port = get_env_variable("LISTEN_PORT").unwrap();

Similar, right?

Imports, optionals and argument labels

import syntax was probably the first thing I felt weird about. It’s like a “recursive glob import”. In Rust, you need to be very specific about what you want in your module, but here you import a package at the top of the file, and you get to use all (public) stuff from that package! This means, if we’d imported multiple libs, then it’s hard to know (without an IDE) where some item (type / function / whatever) is from - not to mention that we also need to grep the item in that lib to find wherever it’s located.

In order to reduce the burden, Swift devs arrange their modules in such a way that they’re self-explanatory. An example would be Vapor’s PostgreSQL driver lib. There, we have PostgreSQLConnection type in its own module, but then we also have a number of PostgreSQLConnection+Foo.swift files that contain additional implementations for that type related to some behavior “Foo” (in different modules).

Then, there are the optionals. In Rust, Option is just like any other enum, which means None is just another value. In Swift, even though Optional is an enum, it’s baked into the compiler such that ? operator (in suffix) represents an optional type, which means nil is a special value to indicate nothing. As a result of this, unwrapping an optional can be as simple as using another (exclamation !) operator.

In the above example, name is the label and variable is the actual argument to be used in the function body. Argument labeling was weird at first, but nowadays, I wanna label everything! Here’s a snippet from our platform:

let product = try products.getProductVariant(for: unit.product,
                                             from: order.shippingAddress)
try checkCurrency(for: product)
let inventoryItem = try inventory.getInventoryItem(for: product.inventoryId)
try updateConsumedInventory(for: inventoryItem, with: product, in: unit)

With labeling, it’s possible to write some cool expressive code.

try … catch

In Swift, Error is a protocol (interface, if you want) which can be implemented for any type, just like Rust, where it’s a trait (another buzzword!). Only difference is that here, an error value can be thrown by some operation and can be caught elsewhere when it bubbles up.

To see this in action, let’s write a function which throws 95% of the time:

enum Luck: Error {
    case worse
    case bad
}

func iFeelLucky() throws {  // mark explicitly
    let i = Int.random(in: 0 ..< 100)   // generate Int in [0, 100)

    // switch: not just ranges, but patterns (tuples and enums too!)
    switch i {
        case 0 ..< 50:
            throw Luck.worse
        case 51 ..< 95:  // biased rejection
            throw Luck.bad
        default:
            return
    }
}

do {
    try iFeelLucky()
} catch Luck.worse {
    print("Definitely not!")
} catch Luck.bad {
    print("Try again?")
} catch {
    // some other sorcery?
}

The do { } block represents your trial area, and you catch the error next to that block. It’s worth mentioning that when you throw your error value, the actual error type is erased (since it gets casted to the Error interface), so you can throw any kind of error from the same block (if you’ve got good reason to do that). Later, when you catch, you can pattern match by casting it back to the actual error type.³

Types

All types (and protocols) in Swift can be extended - regardless of whether they’re from a foreign package, like Foundation. For example, we could extend strings with an alphanumeric check like so:

import Foundation

extension String {
    /// Checks if the given string is alphanumeric.
    public func isAlphaNumeric() -> Bool {
        return self.range(of: "[^a-zA-Z0-9]", options: .regularExpression) == nil
    }
}

We’ve just added some functionality to a type that doesn’t belong to us! It’s an useful abstraction, yes, but Rust doesn’t allow you to do this,⁴ and I think there’s a good reason for it. When you start extending stuff you don’t own, users will have trouble finding the implementation - whether it’s from your package, or it’s from a dependency, or whether this has existed in a core package all this time!

That said, I’m not against it either (I’m doing it myself!). I’m simply unsure about the downsides (if any) to not using / having this feature.

Structs and Classes

In addition to enums and tuples, Swift supports structs and classes - both have static and stored properties (readable, writable or computed). Also, access to types, properties and methods can be controlled with modifiers.⁵

Let’s take a dumb struct:

import Foundation

public struct Customer {

    public let id: UUID
    public let createdAt: Date
    public var updatedAt: Date
    public var firstName = ""
    public var lastName = ""
    public var primaryEmail = ""

    public var isValid: Bool {
        return !firstName.isEmpty && !lastName.isEmpty // && validate email
    }

    public init(id: UUID, createdAt creation: Date, updatedAt updated: Date) {
        self.id = id
        createdAt = creation    // no name collision - can ignore `self`
        updatedAt = updated
    }

    public init() {
        let date = Date()
        self.init(id: UUID(), createdAt: date, updatedAt: date)
    }
}

I’m ambivalent about having methods as part of the type itself, but other than that, I like a number of things here:⁶

Foundation has a lot of stuff! So far, we’ve seen random number generation, regex, UUID and datetime. It’s nice to have all these things in stdlib, so it’s one less worry for us.
Mutation is field-specific. In the above example, id cannot be changed for an instance (even if the instance itself is mutable).
Functions could have the same names, as long as they have different signatures. init is special (in that it’s the constructor), but it’s no different from any other function.

Values and References

All classes in Swift are “reference types” and all other types are “value types”. The difference is that instances of value types are copied. Coming from Rust, this felt like infidelity, but well, that’s what you pay for using languages with automatic memory management. Arrays and dictionaries are structs, so every time you assign them to some variable or pass them to another function, they get copied!

var a = [0, 2, 5, 10]   // array is a struct
var b = a       // copied
b.append(11)    // "a" still has 4 elements

class Foo {
    var inner = [0, 2, 5, 10]
}

let f = Foo()
let g = f               // "f" and "g" hold reference to same class
g.inner.append(15)      // "f.inner" and "g.inner" are same (5 elements).

Protocols

Traits are one of those lovely things in Rust. With generics, they’re simply beautiful. Having used to them, it wasn’t hard for me to get into protocols (interfaces of Swift). All types in Swift can implement protocols.⁷

That said, Swift has its limitations when it comes to protocols. Generics and traits in Rust are rather robust. For example, we can do this in Rust but not in Swift:

impl MyTrait for T where T: MyOtherTrait {
    // MyTrait impl
}

This translates to, “Implement MyTrait for all types that implement MyOtherTrait”. This has some wonderful effects. From and Into traits are my favorites. If your type implements From, then (because of this feature) it gets the Into implementation for free!

In Swift, you can add a protocol extension with such a constraint.

extension MyProtocol where Self: MyOtherProtocol {
    // MyProtocol impl
}

But, this doesn’t automatically apply the implementation for all MyOtherProtocol implementors. You still need to extend your types specifically and mark them like:

extension MyStruct: MyProtocol {}

This hasn’t become a big deal for me yet, just saying.

Otherwise, protocols are quite cool. There’s a protocol for hashing, equality and iteration (just like Rust), and there are others like one for types that could be raw values, encoding and decoding.

Then, there’s Codable which unifies serialization and deserialization (again, built into Foundation). The problem is that it’s not even close to serde, which is the commonly used encoding / decoding lib in Rust.

In serde, you can do almost anything with a bunch of attributes (you rarely need to write custom code), which is a great perk for using a statically typed language, whereas in Swift, let it be skipping a property, managing a particular property on your own or performing additional validation, anything that deviates even a little requires custom code. It’s not hard to write, but it’s difficult to maintain - whenever you alter the structs, you need to modify that custom implementation. It’d be nice if it could be done with less effort.

If there’s one thing I like about Swift protocols, it’s automatic box’ing (another perk of managed languages). In Rust, you need to specify the pointer which holds a particular trait object. I don’t want this to change. It’s always been and should always be that way in Rust (I need to know whether I’m using Box, Arc or a simple reference), it’s just that it’s sometimes annoying (depending on the use case) to box stuff on our own when you’re dealing with trait objects.

Packaging

Swift Package Manager reminded me of build scripts in Rust, because you write in Swift to build your Swift package, and I like it. But, there’s no central registry upon which SPM relies on (like crates.io for cargo). Instead, it needs Git. In order to specify a package as a dependency, you have to specify the URL of a git repo, and versions are based on tags. That said, you can specify a branch / revision in that repo, or simply use your local path - everything works, so I haven’t had any trouble with it.

I also liked SPM’s model - a package has a name, a number of products (libraries and executables), dependencies and targets. Products depend on targets. Tests are part of targets (called test targets). This means, a package can output any number of executables and libraries, and tests can be located anywhere (typically they’re inside Tests/ in project root). In Rust, we can use workspaces to output multiple products, but unit tests cannot exist elsewhere.⁸

The Future

I think Swift and Rust have a number of similarities in their features (other than the obvious differences).⁹ I have a blind wish that it gets procedural macros from Rust at some point!

Anyway, it didn’t require much effort for someone coming from Rust to get into Swift (but I guess that’s the case for jumping from Rust into any other language, because well… we’ve learned from the master!). I like the way things are in Swift right now, and I’m looking forward to where it’s headed.

If I were to write a web service today, then Swift will be my choice without any second thoughts.

I may have left out some things along the way, but I’ll update the post whenever something comes to mind.

I don’t think I’ll be able to write about my WebID-TLS work just yet. I’ve been diagnosed with CTS lately, and it’s ascending, so I’ll probably be taking a break from my computer starting this February or something and I need to wrap up some work before that. ↩
Not as much as I love Rust though! ↩
I uhh… personally hate try-catching - I like Rust’s way of dealing with fallible types using the Result enum, but again that’s just my preference. ↩
In Rust, you can only extend your own type or implement your own trait to other types. ↩
Rust 1.18 introduced support for even fine-grained access control like enabling access in a particular crate, in a module or even a specified path, etc. (besides public and private) ↩
We don’t have any of this in Rust, although sometimes I wish we had argument labeling, differentiating functions based on signatures (not names), and some core crates getting stabilized. ↩
Although, protocols marked with class can only be implemented by classes. ↩
They need to exist in the same module. They can stay outside, but they won’t have access to any internally used types, whereas in Swift, you can mark packages as @testable in imports just for testing. ↩
Speaking of the future, Swift has NIO for event loops and futures (again, somewhat similar to Rust). ↩

The Swiss Army Knife of Hashmaps

2018-12-07T22:28:46+05:30

A while back, there was a discussion comparing the performance of using the hashbrown crate (based on Google’s SwissTable implementation¹) in the Rust compiler. In the last RustFest, Amanieu was experimenting on integrating his crate into stdlib, which turned out to have some really promising results. As a result, it’s being planned to move the crate into stdlib.

While the integration is still ongoing, there’s currently no blog post out there explaining SwissTable at the moment. So, I thought I’d dig deeper into the Rust implementation to try and explain how its (almost) identical twin hashbrown::HashMap works.

Hashing and Hashmaps

In order to establish some terminology, I’m gonna start from scratch. If you know about hashing, hashmaps, open-addressing and cache performance in general, feel free to skip this section.

While arrays and linked lists hold a sequence of items, hashmaps (or tables) hold key/value pairs i.e., they bind values to keys. So, you insert a value for a key and you can later address those values (fetch/remove/whatever) using the same key.

Hashing is basically computing a special number (the hash) for some hashable object, provided that if two objects are equal, their hashes must be equal. When a hash function produces the same hash for two different objects, we call it a hash collision. A perfect hash function doesn’t result in collisions, but since we’re not in an ideal world, collisions happen (the rate depends on the algorithm).

Hashmaps use arrays as their backend. In the context of hashmaps, the array elements are called buckets or slots (I’ll be using this interchangeably). Let’s take the following array:

 index |  0  |  1  |  2  |  3  |
-------|-----|-----|-----|-----|
 value |     |     |     |     |

The value here is the array’s element in that index (usually, it’s the key/value pair as a whole!).

This array has 4 slots (it could be just one!). We wish to insert a key/value pair (4, 8), for which we use a hash function H(x). In order to find the index at which the key/value pair should be inserted, the key is hashed using the hash function H(K), and the index is obtained by modulo’ing the hash using the length of the array.

H(4) = 12638164110811449565
i = H(4) % 4 = 3

For the sake of keeping this post simple, I’ve used an unspecified hash function² (so don’t worry about it) - only the actual hashes matter as far as we’re concerned. And, note that the hash is 64 bits - we’ll be sticking with this size throughout the post. Again for simplicity, we’re only focusing on 64-bit machines.

Anyway, now that we’ve found the index of the bucket, let’s insert the key/value pair. Our array now looks like this:

 index |  0  |  1  |  2  |  3  |
-------|-----|-----|-----|-----|
 value |     |     |     | 4,8 |

But, why did we have to add both the key and value (4, 8), instead of just the value 8?

Dealing with collisions

Right now, for performing an operation on a key/value pair in the map, we find it by following the same steps - hash the key, modulo N, land on the index and perform the desired action.

Hashing functions, despite that they have a whole 8 bytes, encounter hash collisions. By doing the modulo operation, we’ve greatly reduced their range. Let’s try inserting (8, 2) into our map;

H(8) = 12638161911788193143
i = H(8) % 4 = 3

But, we already have an element at index 3!

In order to deal with these collisions, we group similar data together. One way is to assign a linked list to each bucket. Now, our map will look like:

 index |  0  |  1  |  2  |  3  |
-------|-----|-----|-----|-----|
 value |     |     |     | |o| |
                          __|__
                         | 4,8 |
                         |-----|
                         | 8,2 |
                         -------

… and when we wish to find an element, we stop at the bucket, traverse through the linked list and compare the keys for locating the value. This method of using another data structure for storing the values in each bucket is called separate chaining. So, if we keep getting collisions for subsequent insertions, our linked list will get bigger and that will impact our performance, right?

Not exactly. This is where we talk about the load factor. Just like all other dynamic data structures, hashmaps should be able to resize at will! Load factor is the ratio of the number of elements in the hashmap to the number of buckets. Once we reach a certain load factor (say, 50%, 70% or 90% depending on your configuration), hashmaps will resize and rehash all the key/value pairs.

Let’s say that our hashmap doubles in size at a load factor of 60%. This means, once we add a third element (5, 7), our new hashmap will resize (regardless of whether it’s colliding). It will now look something like:

 index |  0  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |
-------|-----|-----|-----|-----|-----|-----|-----|-----|
 value | |o| |     |     | |o| |     |     |     | |o| |
        __|__             __|__                   __|__
       | 5,7 |           | 4,8 |                 | 8,2 |
       -------           -------                 -------

The choice of the load factor depends entirely on the internals of your map i.e., how it hashes and determines buckets. For example, you have a weaker hash function which results in adding a number of elements to the same bucket, and in order to reduce traversal, you’d probably go for a resize / rehash when you hit a relatively lesser load factor (the choice is based on a gazillion performance tests!).

However, we have a major performance bottleneck. Firstly, the usage of external data structures require additional allocations / pointers which themselves consume some memory per element. And second, when it comes to linked lists, they have worse processor cache performance.

Okay, what does that mean?

CPU Cache

Our RAM is faster than say, our SSD (like, a hundred times!), but it’s also a hundred times slower than the CPU³. That’s why CPUs have their own caches (L1, L2, L3, etc.). Data flows from memory to the CPU cache in fixed sized blocks called cache lines. This fetch can take up to a 100 ns (~200-300 clock cycles). In contrast, L1 cache reference is ~1-2 ns, whereas for L2, it’s ~8-12ns (CPU caches are hierarchical, think of L1 as close to the CPU and smaller in size compared to L2).

What this means is that, whenever the CPU needs to read/write to a memory location, it checks the cache(s), and if it’s present, it’s a cache hit, otherwise it’s a cache miss. Whenever a cache miss occurs, we pay the cost of fetching the data from main memory (thereby losing a few hundred CPU cycles by waiting).

Coming back to linked lists, the pointers of subsequent nodes could be anywhere, which results in fetching cache lines randomly. This indirection leads to the poor cache performance of linked lists.

The other way

The second way (for our hashmap) is to get rid of using external data structures completely, and use the same array for storing values alongside buckets. With another element (12,9), our map will look like this:

 index |  0  |  1  |  2  |  3  |  4   |  5  |  6  |  7  |
-------|-----|-----|-----|-----|------|-----|-----|-----|
 value | 5,7 |     |     | 4,8 | 12,9 |     |     | 8,2 |
-------|-----|-----|-----|-----|------|-----|-----|-----|
 slot  |  0  |     |     |  3  |  3   |     |     |  7  |

I’ve added another row to indicate the buckets/slots computed from the hashes.

Note that the new key/value pair is on index 4, even though its slot index is 3. This way, we sequentially fill the empty slots. Also, the ordering is unnecessary.

Let’s try and add a few more elements…

 index |  0  |  1   |  2  |  3  |  4   |  5  |  6  |  7  |
-------|-----|------|-----|-----|------|-----|-----|-----|
 value | 5,7 | 29,7 | 6,6 | 4,8 | 12,9 | 3,4 |     | 8,2 |
-------|-----|------|-----|-----|------|-----|-----|-----|
 slot  |  0  |  0   |  1  |  3  |  3   |  2  |     |  7  |

As you can see, (3, 4) has a slot index of 2, but since that index already has an element, we’re inserting it into the first empty element we find (which is at 5). During fetching, we land on a slot (based on the hash), compare the keys one by one, and traverse all the way until we either find a matching pair or an empty slot (probing).

Here’s the catch. When you remove an element, you can’t simply remove the key/value pair from a slot and be done with it. If you do that, then you’ve created an empty slot, which could result in the map complaining existing values to be non-existent (because the search has encountered an empty slot). So, you have two choices:

In addition to removing a pair, we also move the next pair to that slot (shift backwards).⁴ For example, if we removed (4, 8) in the above map, then we move (3, 4) to its position (which has the lowest slot index), and the table will now look like:

 index |  0  |  1   |  2  |  3  |  4   |  5  |  6  |  7  |
-------|-----|------|-----|-----|------|-----|-----|-----|
 value | 5,7 | 29,7 | 6,6 | 3,4 | 12,9 |     |     | 8,2 |
-------|-----|------|-----|-----|------|-----|-----|-----|
 slot  |  0  |  0   |  1  |  2  |  3   |     |     |  7  |

Or, you add a special flag (tombstone) to the removed slots, and when you probe, you skip the slots containing that flag. But, this will also affect the load factor of the hashmap - removing a number of elements in a map followed by inserting a few could trigger a rehash.

There are other ways to resolve hash collisions in open addressing (quadratic probing, where we quadratically address the elements instead of sequentially, and double hashing where we use another hash function to land on the slot), but they’re out of scope of this post.

The advantage of linear probing is that it has the best cache performance - if you think about it, linear probing sequentially visits the elements, which means, (most often, depending on the size of the object) the data is part of a cache line, which is great, because we don’t waste CPU cycles (in contrast, quadratic probing doesn’t offer such cache performance and double hashing uses another hash function, which by definition means that we’re spending more work on computing another hash).

Although, we do have a problem with openly addressed maps (linear probing in particular) - clustering. Whenever a collision occurs, we start queueing the key/value pairs. For example, in the above table, because (29, 7) has the same slot as (5, 7), it’s at index 1, and we had to place (6, 6) at 2 (even though its slot index is 1), and other elements follow.

This way, as the table grows, more elements get pushed away from their actual index (which elongages the search/probe sequence, thereby increasing the cache misses). It also depends on the hash function, in that, a weaker hash function can lead to more collisions, and hence, more clusters.

Robin Hood hashing

I’m also going through Robin Hood hashing, because it’s a nice optimization and at the time of writing of this post, Rust used this in std::collections::HashMap, but again if you know this, feel free to skip this section. It has nothing to do with hashbrown itself.

This is one of the few methods to improve the overall efficiency of openly addressed hashmaps. It offers a way to redistribute an existing key/value pair during the insertion of a new pair. For example, let’s take this openly addressed map:

 index |  0  |  1   |  2  |  3  |  4   |  5  |  6  |  7  |
-------|-----|------|-----|-----|------|-----|-----|-----|
 value | 5,7 | 29,7 | 6,6 | 3,4 | 12,9 |     |     | 8,2 |
-------|-----|------|-----|-----|------|-----|-----|-----|
 slot  |  0  |  0   |  1  |  2  |  3   |     |     |  7  |

We’re interested in a variable called distance to the actual slot - I’ll call it D(value). For example, the actual slot of (5, 7) is 0, and it’s located at 0, so the distance is 0. For (29, 7) on the other hand, the distance is 1, because it’s at 1, even though it should’ve been at 0.

In robin hood hashing, you follow one rule - if the distance to the actual slot of the current element in the slot is less than the distance to the actual slot of the element to be inserted, then we swap both the elements and proceed.

Okay, that was ugly. An example will really help us.

We’d like to insert (13, 3) into this map. The slot index for 13 is 0. So, let’s start at 0. We already have (5, 7) located there, which also has a slot index 0.

 index |  0  |  1   |  2  |  3  |  4   |  5  |  6  |  7  |
-------|-----|------|-----|-----|------|-----|-----|-----|
 value | 5,7 | 29,7 | 6,6 | 3,4 | 12,9 |     |     | 8,2 |
-------|-----|------|-----|-----|------|-----|-----|-----|
 slot  |  0  |  0   |  1  |  2  |  3   |     |     |  7  |
          ^
          | D(13) == 0
        (13,3)

Both have the same distances i.e., D(5) == D(13) == 0. We move forward.

 index |  0  |  1   |  2  |  3  |  4   |  5  |  6  |  7  |
-------|-----|------|-----|-----|------|-----|-----|-----|
 value | 5,7 | 29,7 | 6,6 | 3,4 | 12,9 |     |     | 8,2 |
-------|-----|------|-----|-----|------|-----|-----|-----|
 slot  |  0  |  0   |  1  |  2  |  3   |     |     |  7  |
                ^
                | D(13) == 1
              (13,3)

Still the same. D(29) == D(13) == 1 (both are 1 slot away from their actual slots). Moving on…

 index |  0  |  1   |  2  |  3  |  4   |  5  |  6  |  7  |
-------|-----|------|-----|-----|------|-----|-----|-----|
 value | 5,7 | 29,7 | 6,6 | 3,4 | 12,9 |     |     | 8,2 |
-------|-----|------|-----|-----|------|-----|-----|-----|
 slot  |  0  |  0   |  1  |  2  |  3   |     |     |  7  |
                       ^
                       | D(13) == 2
                     (13,3)

In the next stop, we see something. D(13) == 2 whereas D(6) == 1 and hence D(6) < D(13), which means we should swap the pairs and proceed.

 index |  0  |  1   |  2   |  3  |  4   |  5  |  6  |  7  |
-------|-----|------|------|-----|------|-----|-----|-----|
 value | 5,7 | 29,7 | 13,3 | 3,4 | 12,9 |     |     | 8,2 |
-------|-----|------|------|-----|------|-----|-----|-----|
 slot  |  0  |  0   |  0   |  2  |  3   |     |     |  7  |
                       ^
                       | D(6) == 1
                     (6, 6)

Now, we go looking for a place to insert (6, 6). This way, we compare and swap the key/value pairs based on their distances. In the end, we have this nicely distributed hashmap:

 index |  0  |  1   |  2   |  3  |  4  |  5   |  6  |  7  |
-------|-----|------|------|-----|-----|------|-----|-----|
 value | 5,7 | 29,7 | 13,3 | 6,6 | 3,4 | 12,9 |     | 8,2 |
-------|-----|------|------|-----|-----|------|-----|-----|
 slot  |  0  |  0   |  0   |  1  |  2  |  3   |     |  7  |

Since the hashmap has reached almost its capacity, you might be wondering whether this would’ve already triggered a rehash. But, the resize/rehash depends on the load factor, and because this method redistributes the key/value pairs regardless of when they get inserted (takes away from the rich and gives it to the poor, hence the name), the hashmaps could now have higher load factors of even 90-95%.

This also brings another improvement to searching. We don’t have to probe all the way until we find an empty slot. We can stop when D(slot value) < D(query value), since our rule guarantees that this shouldn’t happen for the key we’re looking for. For example, in the above table, if we wanna query for the key 21 (whose slot index is 0), then we can stop at index 3, because at that point D(6) == 2 which is less than D(21) == 3 which wouldn’t have happened if the key/value pair were there. So, we can safely declare that the key doesn’t exist.

Now that we’ve grazed over a lot of things associated with openly-addressed hashmaps⁵, let’s proceed to hashbrown.

Hashbrown

I’m not gonna call it “SwissTable” from here on because firstly, even though hashbrown was a port of SwissTable, the author has made a few changes to improve its performance, and second, I didn’t read the C++ code at all - I followed the hashbrown crate.

An optimization for linear probing is storing some metadata for each key/value pair. The first thing that comes to mind is the check for equality - once we’ve landed on a slot, we probe by comparing the keys in the slots one by one. While this is easier for integers, it gets expensive for bigger objects. We could store the hash, but, that’s another 8 bytes per slot, which is a huge deal for memory-eating hashmaps!

Let’s reset our map and make some changes to its behavior.

Let’s make the initial size of the internal array to 16 (I’ll get into why we’re doing that in a bit, trust me for now). We call this group of 16 elements a group.⁶ So, a map is made of a number of consecutive groups.
For each slot in a group, let’s assign a byte for metadata and call it control byte. Again, we’ll see what it is soon.

It will now look something like this:

 index |   0    |   1    |   2    |   3    |   4    | ... |   15   |
-------|--------|--------|--------|--------|--------|     |--------|
 value |        |        |        |        |        | ... |        |
-------|--------|--------|--------|--------|--------|     |--------|
 ctrl  |00000000|00000000|00000000|00000000|00000000| ... |00000000|

I’m skipping a number of elements (irrelevant to us) so that it fits to the screen. I’m also representing the “control byte” in bits, because we’ll be playing with bits. As a result, there are 128 bits in this group - a nice round number (we’ll see why).

Our first candidate for insertion is (5, 7).

H(5) = 12638147618137026400

Taking the top (most significant) 7 bits of this hash⁷ and calling it H2(x), we get:

H2(5) = H(5) >> 57 = 87 = 0b1010111

These will be the bottom 7 bits of our control byte.⁸ Then, we use a special bit for our own purposes (to indicate whether the slot value is empty, full or deleted) and this will be the top bit. The states are now represented as follows:⁹

0b11111111  // EMPTY (all bits are set)
0b10000000  // DELETED (top bit is set)
0b0.......  // FULL (whenever the top bit isn't set)

In light of this information, all the slots are empty, so our map will look like:

 index |   0    |   1    |   2    |   3    |   4    | ... |   15   |
-------|--------|--------|--------|--------|--------|     |--------|
 value |        |        |        |        |        | ... |        |
-------|--------|--------|--------|--------|--------|     |--------|
 ctrl  |11111111|11111111|11111111|11111111|11111111| ... |11111111|

To recall what we’ve done so far, we’re storing the top 7 bits of our key’s hash in our control byte, and in addition to that, we use the top bit of the control byte to indicate whether the slot is full, empty or deleted.

Going back to our candidate (5, 7), its slot index is 0 i.e., H(5) % 16 == 0.

 index |   0    |   1    |   2    |   3    |   4    | ... |   15   |
-------|--------|--------|--------|--------|--------|     |--------|
 value | (5,7)  |        |        |        |        | ... |        |
-------|--------|--------|--------|--------|--------|     |--------|
 ctrl  |01010111|11111111|11111111|11111111|11111111| ... |11111111|

Once the pair is inserted, we’ve also set its control byte to H2(5), since the top bit is zero anyway (because it’s now FULL). Now, let’s try inserting (39, 8).

H(39) = 17050702200253021726
i = H(39) % 16 = 2
H2(39) = H(39) >> 57 = 54 = 0b110110

And, we do the same thing.

 index |   0    |   1    |   2    |   3    |   4    | ... |   15   |
-------|--------|--------|--------|--------|--------|     |--------|
 value | (5,7)  |        | (39,8) |        |        | ... |        |
-------|--------|--------|--------|--------|--------|     |--------|
 ctrl  |01010111|11111111|00110110|11111111|11111111| ... |11111111|

Now, we’re all set. Let’s start addressing the reasons behind whatever we’ve done.

Why the round number?

Each group contains 16 slots summing up to 8 control bytes. The first natural question is why we’ve restricted the group size to 128 bits.

When we query for a key in the map, we first use the hash to land on the group corresponding to a slot, find the offset of the slot inside that group and start probing by comparing against the 7-bit hashes in the control byte.

The boost here is that the control bytes (being 128 bits) can fit into an L1 cache line. This means, we can probe an entire group really quickly, before having to fetch another group from L2 or L3 or whatever. And, we don’t have to worry about comparing the keys at all, until we encounter all 7 bits matching in a control byte.

There’s one other cool optimization for modern processors. Modern CPUs support SIMD instructions, which basically means that we can do some operation (add or multiple or compare, etc.) on multiple values at the same time in a processor! SSE is a subset of that where we can work on different types such as two 64-bit floats, four 32-bit integers or sixteen 8-bit integers.

Now, our workflow will simply be:

Load up these bytes from an array.
Set the byte to compare.
Compare both the values.
Mask the values from comparison to true or false.

And, that’s finding the results from 16 slots in four CPU instructions!

In the above example, if we wish to find 39, then all we have to do is find the position of the group using the hash, find the 7-bit value 0b110110 from the hash, and do this:

1. Load group (A)

---------------------------------------------     ------------
| 01010111 | 11111111 | 00110110 | 11111111 | ... | 11111111 |
---------------------------------------------     ------------

2. Set comparable 0b110110 (B)

---------------------------------------------     ------------
| 00110110 | 00110110 | 00110110 | 00110110 | ... | 00110110 |
---------------------------------------------     ------------

3. Compare A and B

---------------------------------------------     ------------
| 00000000 | 00000000 | 11111111 | 00000000 | ... | 00000000 |
---------------------------------------------     ------------
                       (success!)
4. Mask values

---------------------------------------------     ------------
|    0     |    0     |    1     |    0     | ... |    0     |
---------------------------------------------     ------------
                         (true)

After masking, we actually get an integer, because the final result for each group is either 0 or 1, which could all be accumulated into an integer. In other words, the value and position of each bit in the returned integer corresponds to a match of a slot in the group.

So, we have the results! Now, all we need to do is check the indices of bits that are set in the final integer, compare the key(s) in those slots against the querying key (for equality), find the corresponding value, and we’ll land on (39, 8).

Hints to compiler!

Futher optimizations can be done on this implementation. If we’ve used a good hash function that distributes the bits reasonably well, then we can hint the compiler that the final equality check (for the key) will almost always be true. In Rust, we have likely and unlikely to achieve this. So, we can tell the compiler that the equality is likely to be true.

The next hint is on whether we should probe to the next group looking for that key. Again, if our hash function is good enough, then the odds of that happening is very very low.¹⁰ So, we can hint the compiler again that it’s likely to stop probing.

When we remove a key/value pair, we can follow the tombstone method - we query the key as usual, find the slot (in some group), set a tombstone (by marking the control byte as DELETED) and later mark it back as EMPTY (when we resize, for example). But, we could take advantage of the previous fact and say that if the group had at least one empty element, then we don’t have to add a tombstone. We could simply set it to EMPTY, because the probing is gonna stop with this group anyway (because the probing would’ve encountered an EMPTY).

Amanieu has added more stuff like making the hashmap efficient for 32-bit platforms, supporting maps smaller than a group, and a ton of other features to make it compatible with std, which is pretty cool!

I hope you found this post interesting. As always, feel free to drop any comment if you have anything to add.

A huge thanks to Ana and Amanieu for reviewing this post!

I insist on watching this talk when you have some free time! ↩
Although, hashbrown uses FX hash. ↩
What Every Programmer Should Know About Memory. ↩
This is what Rust used in its Robin hood hashing implementation. ↩
I haven’t talked about a number of improvements that could be made to Robin Hood hashing - backward shifting entries during deletion (instead of tombstones), caching hashes, slot indices or “distances” to improve probing, etc. all improve the performance of the hashmap. ↩
This doesn’t mean that a hashmap that contains, say 2 elements should have a 16 element array (for the values) - only the group has 16 elements, the actual array containing the key/value pairs will still be 2 elements. ↩
In the SwissTable implementation, the bottom (least significant) 7 bits were used for H2, but Amanieu has claimed that his choice lead to slightly more efficient code. ↩
Again, this is how hashbrown is implemented. SwissTable used the top 7 bits. ↩
One more Rust-specific enhancement - SwissTable had kSentinel to deal with C++ iterators, which wasn’t required in Rust. ↩
According to the talk, it’d take a load factor of 94% for an ideal map to reach that situation. ↩

1 year…

2018-11-20T15:40:26+05:30

It’s been a year since I’ve blogged, despite people reminding me that blogging is a good thing and that it helps in the longer run. I had at least 3 posts in mind over the past few months, but I kept procrastinating on them. I now realize that if I keep doing that, then I might just as well stop blogging.

So, I’ve planned to blog about two topics by the end of this year:

Experience report on using Swift for a year from a Rustacean’s perspective (what I liked, what I feared about, what I hated, etc.).
Building a HTTP (+ TLS) server from scratch using NIO (Swift NIO is a low-level framework for building network applications, which has built-in support for event loops, futures, HTTP/2, etc.).

Firstly, here’s the story of my 2018 projects…

Building services for a client

Although my job began with the promise of working on Rust (FFI-bridging Rust and Swift), we had to move to Swift in a few days, because it felt easy and productive to build web services in a managed language like Swift compared to Rust (that said, we do use Rust for other things - our infrastructure, for example).

Our studio had one goal - build open-source, decentralized web and e-commerce platforms, which are free, intuitive, and easy-to-use (and are now open for early access!). My colleague (our technical lead and director) was working on the web platform, and me on the e-commerce thingy.

We got a client who wanted to move away from their Wordpress stack and start using our platforms (especially for e-commerce, as they were selling coffee-related products). The problem was that they had no developers, and my colleagues only had time for the ongoing projects in our studio (so they were unavailable to commit full-time).

Our agreement was that I would work full-time (as a full stack developer), our designer would be contracted for a few days every month, our director would be spending 3 days every week for them (code review, meetings, sprint planning, etc.), and that they should hire developers along the way, because it’s simply too much work for one person to accomplish in the given time frame (2 months).

Moreover, we need to polish our platforms every now and then, because the whole point of joining them was for mutual benefit - they get our services, and we use their data to improve our platform (not to mention that we also had to add features specifically for them).

After a while, they ended up extending their milestone (3 more months), but they never hired any developer to support us. After 2 months, I was still lagging behind. I was already spending my weekends on our platform, and the last thing I wanted was to spend 12+ hours on their frontend implementing features. With only one month left to go, stressed by the sprint features, deadlines and a ton of broken things to fix, I knew that I was gonna hate working on frontends eventually.

And then, it happened.

They hadn’t paid us for 5 months already (we kept reminding them, they kept postponing it), and when we reminded them about it (in 4 months), we got a message saying that they weren’t gonna pay us until we finish the platform. Firstly, they breached our contract by saying this, which means we’re free to pursue other clients and work on our own projects, and second, we knew that it’s gonna take more time to finish their platform, and if we continued to spend time for them, our studio will run out of money.

Aftermath

As for me, that’s one more failed startup (in addition to what happened last year).

We stopped working for them (waiting on them to pay their debts) and resumed our work on the platform. We launched it for early access users and it’s registered as a separate company (which makes me… a cofounder!).

We’ve got two more clients now, for whom we’ll be building (non-Naamio) applications over the next year, and fund Naamio with that money. Let’s see how that goes…

Thanks to Phil for reviewing this post…

8 months…

2017-11-09T21:21:47+05:30

It’s been a while since I blogged. Last few months have been… well, complicated. Now that I’m travelling, I won’t get a better chance to tell the story.

As far as the world knows, I’ve been writing some Rust and Python code for some bioinformatics company (which is one of the few companies using Rust in production in South India). Job was good until the start of this year. As time flew, the folks who run the company made some fuckups - they introduced new rules, they kept postponing the product release, and eventually, I got tired of researching stuff for them.

So, I got back into open-source and started looking for a new job.

Job hunt

Job hunting was hard, especially with my experience. Back then, it was only a year since I began writing production code. Almost everyone’s looking for senior developers with 3-5 years of experience in PHP or Node or whatever.¹ The rest who look for junior/mid-level devs either didn’t respond, or rejected with an automated mail - which I understand, they get a lot of traffic, they can’t hire everyone!

I longed for a human response, even if it was a rejection. Because, I don’t simply go around and bulk apply for all the jobs that I could find! I take my time - only if I’m convinced that I might be a good fit for their open position, I tailor my email and send it to them.

Out of the 57 companies I’d applied in 2 months, 8 rejected with an automated mail (it’s funny to think that one company rejected me after 3 months), 6 rejected because I wasn’t around their location (which they didn’t specify in the job listing), 3 rejected after technical round (I hate competitive coding, really!), and the rest never replied!

Meanwhile, I was working on Servo stuff. I wrote the parsing and serialization code for CSS grid (for Stylo). I rewrote a bot as a Github integration and added some cool features for it, and I was one of the coaches for two girls who participated in RGSoC 2017.

Into a YC startup

Around mid-May, @Manishearth linked me to a tweet which claimed that some startup called “Surematics” is looking to hire Rust devs. I applied, wrote some code, had a chat with the CTO (Phil) and CEO (when I also realized that the startup is part of YCombinator’s 2017 summer batch), booked my flights, and by June, I was in Mountain View on a 3-month contract (with a possible employment after demo day). It all happened quite fast!

My stack was totally new! Typescript for frontend, Rust for backend services, Kubernetes (in Azure) for orchestrating the microservices, and some cool new technologies like docker, vault, etcd and cockroach. The learning curve was huge!

What surprised me was the fact that we’re in YC’s batch without a product! I’ve been told that YC usually funds and guides startups which already have a product that’s being used by some clients. In our case, we didn’t even have a proper layout of the product!

Anyway, we’d planned to deliver something usable every week, but we couldn’t, because (in short) the goals/decisions kept changing, and we went on hacking stuff just to achieve those goals (which eventually kicked us in our asses). While I’d love to talk about this, it’s a big story, which Phil has summarized in a lovely series of posts.

In the end, the startup merged with another company, and we all had to handover our work and go home. The irony is that, by that point, we had a nice working version of the product (because we’d been coding for 2 days straight!), but it didn’t matter.

In my last few days at CA, I was mostly enjoying. But, aside that, I bought a new domain, designed my website, and launched it along with my projects in DigitalOcean. Now, I have a customized static server in Rust, my projects, Servo’s bots - all in docker containerized environments in CoreOS machines.

Post-YC life

My post-YC project was a bot library for the Wire messenger in Rust,² which actually took me into the async features of Rust for the first time. It was fun!

After some latency period, I began hunting for jobs again. Now, things were easy (probably because of the YC stunt). I got through some interviews and got some cool offers, but in the end, I rejected those because I got a better offer from Phil (the one who hired me for Surematics). He’d hooked me up with a Finnish startup, where I get to be a full-stack developer again! Only difference is that this time, I’ll be working on Swift along with Typescript and Rust.

My contract officially began this month, and I’m already enjoying it, because my work (right now) is mostly on FFI-bridging Swift and Rust (dark arts!). I’ll try to write up a technical post on this.

Anyway, last few months were interesting. Glad to be back now!

And yeah, I hate ‘em both! There, I said it. ↩
Well, they had a lot of libraries in Rust, but this particular SDK was in Node and Java, so I went ahead and took this project! ↩

Drawing an ASCII sketch

2017-03-01T00:46:53+05:30

Every once in a while, I get a (seemingly) nice and interesting idea (thanks to a wonderful female creature, who’s always been my muse), and whenever I get one, I go straight to researching more about it, allocating most of my free time, so that I finish it up ASAP and show it to her. Last time, it was a CSS spewing thingy. This time, it was about generating an ASCII sketch of a picture.

I’m sure you’d have seen all those “Image to ASCII converter”, “ASCII art generator”, and all sorts of boring variants of this online. But, I’ll tell you where they all fail and how I managed to bring up a decent sketch in ASCII. It took me about 3 hours to come up with a basic sketch, and then a few more for making it more generic and deploying it in my website.

Mapping ASCII values over RGB…

Let’s say I want to draw the ASCII sketch of this JPEG image…¹

I always tend to hack on stuff with Python. And, it has an amazing image library to play with images. With PIL, we can do something like this,

>>> from PIL import Image
>>> img = Image.open('sample.jpg')
>>> px = img.load()
>>> px[0, 0]
(72, 94, 91)

So, we now have all those 3-tuple RGB values in a 2D array. The first step would be to convert these RGB values to intensities² (i.e., grayscale, since the final ASCII art will look very similar to its grayscale version). It’s very easy, and PIL eases this a bit more,

>>> img = img.convert('L')
>>> px = img.load()
>>> px[0, 0]
87

Next stop is to have a bunch of characters sorted with respect to their pixel densities (like ' ' (space) for white, '.' (dot) for gray, '#' for black and so on), Once we have a character map, we simply map the characters over the grayscale image, like so…

>>> width, height = img.size
>>> for j in xrange(height):
...     print ''.join(CHARS[px[i, j] % len(CHARS)] for i in xrange(width))

Looks very simple, right? All we need to do is find CHARS (the character map).

There’s an ASCII art generator in Python, which has an interesting implementation for the character map. First, the printable ASCII characters are drawn in an image. Then, they’re sorted according to the pixel density of their render. This means, “space” has zero pixels, and so it will be the first thing you’ll find in the sorted list. Just what I wanted!

Let’s see how the ASCII image turns out after this mapping…

A gradient spew. Almost all the necessary details are gone! Bad luck, huh?

People work around this by clamping ranges of values to some character instead of using the entire ASCII table (like, all the colors ranging from light gray to white will be mapped to “space”). But, that’s still a workaround. It doesn’t help much.

Let’s see if we can tune this, by extracting necessary details from the image.

Getting the details…

An image (just like any other signal) can be represented as a sum of periodic (2D) waves of colors. It’s the magic of Fourier transform, that any signal can be represented as a sum of periodic waves of certain amplitudes and frequencies.

Let’s take our grayscale image and see how the first wave looks like,

>>> import numpy as np
>>> from numpy import fft as fourier
>>>
>>> ft = fourier.rfft2(img)         # get the 2D fourier transform of the grayscale image
>>> ft_new = np.zeros_like(ft)
>>> ft_new[0:1, 0:1] = ft[0:1, 0:1]
>>> rft = fourier.irfft2(ft_new)    # inverse transform
>>> img = Image.fromarray(rft)
>>> img.convert('L').save('fourier-1.jpg')

And, we get this - the initial component.

Now, let’s get the sum of first 3 waves…

>>> ft_new[0:3, 0:3] = ft[0:3, 0:3]     # we only need to change this line
>>> # everything else is the same

… and, we get the lowest frequencies from the image. This would be the base gradient.

… for 10 waves,

… and, for 50 waves,

Clearly, as we go further, the details are starting to show up. So, smaller frequencies indicate gradients, and higher frequencies indicate finer things. In other words, smaller frequencies tell you that there’s a face, whereas the higher frequencies show the finer details like edges, curvature, hairs, etc.

This means, we need the higher frequencies for the details. Well, we don’t have to fiddle around Fourier transform for achieving this, but it gives you an idea. Perhaps, the easiest way to get the details is by filtering out the lower frequencies.

First, we blur the image. Let’s apply a Gaussian blur filter (I usually pick a radius of 7 or 8).

>>> from PIL import Image, ImageFilter
>>> img = Image.open('sample.jpg')
>>> blur_filter = ImageFilter.GaussianBlur(radius=8)
>>> blur_img = img.filter(blur_filter)

Since it’s a low pass filter, we get the image with the higher frequencies stripped out.

Now, we invert the image, and blend it with the original image (with 50% opacity)…

>>> from PIL import ImageOps
>>> inv_img = ImageOps.invert(blur_img)
>>> blend = Image.blend(inv_img, img, 0.5)

This leaves us with the details…

Now, we find a lot of gray areas. So, we have one last (and perhaps the important) step, which is to adjust the levels. For this, we move the image from RGB space to HSV space, clamp the levels to a certain minimum, maximum and gamma values, and convert it back to RGB.

You can think of this as making darker areas black and lighter areas white altogether! It’s quite simple. Here’s an answer from Stackoverflow that provides a Python implementation of how the levels are clamped.

As for our picture, once I clamp the levels (min = 78, max = 125, gamma = 0.78) and convert it to grayscale, I get this…

Looks like we’ve narrowed down more than enough details to get the ASCII art! Now, if we use the character mapping…

And, voila!

As a sidenote, you’re very welcome to play with my ASCII art generator, which does whatever I’d just shown you above.

See you next time!

Okay, I know what you’re thinking, but FYI, that’s definitely not the girl I talked about! ↩
We can however use the color data to get weighted colors and apply them over the final ASCII values, but meh… ↩

Exploring the human genome (Part 1)

2017-02-12T16:54:14+05:30

You may already know that my work in bioinformatics is mostly, well, research. All these months, I’ve been writing little tools in Rust (things that help speed up some boring analysis). These days, I’m involved in something new, something very interesting! Before we get into all that, I’ll try to give a general overview of the flow of data (without going way too much into biology), what kinds of data we deal with, how we analyze it, where I come in, etc., starting from this post.

Preamble - Reading the DNA…

I’m sure you already have an idea about the DNA - the double helix thing, a bunch of ATCGs, the genetic code, the basis of life, etc., but I’ll tell you a better story. To begin, we should go deeper, all the way down to the nucleus of a cell.

In there (for humans), we’ll find 23 pairs of wiggly thingies (chromosomes, if you want). That’s where we’ll also find the tightly packed and coiled DNA strand. Each chromosome has a specific set of nucleotides (ATCGs) and are labeled based on their size - one has ~250 million of them, and so it’s “Chr. 1”, another one has ~50 million of them, and so it’s “Chr. 22”.¹

A genome is a collection of all the genes. A gene is just a sequence of bases used to manufacture a protein (more on this next time). When we say “human genome”, we mean the whole thing, starting from the first base of the first chromosome to the last base of the last chromosome, which contains all the genes necessary for a human being.

Even though the DNA is very small, it’s rather long in its scale. If the size of each nucleotide is around 3Å, then there are ~3 billion of them in total, which means if you stretch the DNA end-to-end, then it will span about 1 meter! So, every cell in your body has ~1 meter of DNA.² Your body has about 10 trillion cells and if they produce a DNA in every cycle, then your body alone produces a light year of DNA in your lifetime! More importantly, that’s almost the same copy of DNA that began in the embryo.

… which is a lot!

We know that the DNA is the basis of all the wonderful mechanisms going on inside, so the first step is to read the DNA. That’s what we call sequencing. The basic process goes something like this. A sample (say, from blood, or liver) goes through a complex preparation and gets loaded into the sequencing machine. Since there are only 4 bases and they bind only with their mate (A to T, or C to G, and vice versa), reading a single helix is enough (as we can always deduce that if one side was “ATTG…”, then the other side would be “TAAC…”).

Firstly, the double helix is unwound and individual strands are separately sequenced. As we’re still in the molecular level, identifying the bases is tough (limited by the sensitivity of our instruments), and so, the sequence is amplified by making huge copies. All you need is a suitable environment and a bunch of additional ATCGs to stick to the chain.

The interesting catch here is that the DNA sample won’t be a single straight strand during the process. It should be split into numerous sequences (ranging from 100 to a few thousand bases) and laid into multiple container-like thingies (lanes), where they’re individually amplified and collectively sequenced. Now, special ATCGs are used to stop the amplification, which light up (by fluorescing a color for each base) as they attach themselves to a particular nucleotide. Sensitive photoelectric devices are then used to take snapshots of these lights in the lanes.

Each color indicates the presence of the corresponding nucleotide, and voila! DNA sequenced! Here’s a wonderful TED lesson to visualize this.

The Reference Genome

Remember that we had to slice the DNA into fragments? Once the sample has been read, the sequencing machine generates a FASTQ file³, which looks something like this,

$ head -8 demo_blood.fastq
@ERR009127.307 IL22_2005:8:1:5:1941 length=36
GCAGACCCAGCGGGGCATGGGCGGACAGAGCCGCAC
+
<B?<@BB>>3BB>))<>43)94@=11=A?@=@B:/-
@ERR009127.308 IL22_2005:8:1:5:944 length=36
GGCGAACGCTTCGCTGGCCATTTAGGAGCTCTGCTC
+
B@AAAA=ABBB@B@BBB><;<A;<9<<;<;=<A;;3

Every 4 lines in this file is a “read”. People are usually interested in the second line (the sequence fragment) and the fourth line (the ASCII-encoded quality of individual bases) in each read. The quality score is much like the machine’s confidence on a particular base.

For example, the last base in the first sequence “C” has a value 45 (‘minus’ in ASCII) whereas in the second sequence, it’s 51 (number ‘3’ in ASCII), which means we’re relatively more confident about “C” in the second sequence. The scale is logarithmic and so, you can’t expect an ASCII value more than 100 to show up all the time in reality.

Even though we’ve done so much to get this FASTQ file, it won’t be of any use by itself, since it doesn’t have the necessary information like where the particular sequence was in the sample, which means we’ve no idea what gene we’re looking at, which isn’t really useful.

So, we should reconstruct the DNA!

Assembly of DNA is a big deal, because you have to figure out where a sequence belongs to. It’s like shredding a novel into bits of paper (which contain nothing more than a few words) and recreating it back from the bits. This takes a long time! And, it’s error-prone. Years have passed since the first attempt, and yet, we don’t know some parts of our own genome.

Building a DNA from the FASTQ file every time we read a sample would be rather silly. So, people have spent years to land on a basic template DNA. This is called the reference genome. It’s incomplete, but it’s accurate enough. Every species has its own reference genome. The human’s is the largest - a horrible 3GB file! It serves as a template because all humans share more than 99% of the genome. In other words, we differ only by a few million bases.

This is what the reference genome looks like…

$ head -5 ~/data/BWAIndex/hs37d5.fa
>chr1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

It begins with “chr1” indicating the first chromosome (and its range - 1 to 249,250,621). Note that it begins with a lot of Ns. “N” indicates that we’ve no idea what base occupies that position (like I said, parts are still incomplete). Let’s seek to a span with some data…

$ sed -n '200,204p' ~/data/BWAIndex/hs37d5.fa
TCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGA
CTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTT
CATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCAC
TGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCAT
CTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATAGGGGAAAGATT

Throughout the file, this is what you find most of the time. So, it’s accurate enough.

Stitching the DNA…

Our quest is always to answer questions like, “Where the changes have happened?”, “Why they happened there?”, “Does/Doesn’t it cause any disease/disorder?”, etc. Now that we have a digital version of the reference genome, the next checkpoint is to align the reads from our FASTQ file to the reference.

Before that, we also do a quality check on the FASTQ file (using a Rust-powered tool) to ensure that it’s good. For instance, you can’t have a lot of Ns in a particular sequence (we obviously don’t want a lot of unknown bases). Also, if we assume that ATCGs are randomly distributed (which is often the case), then you can expect each base to occur 25% of the time on the average (if A is less than 5% and T is more than 40%, then we clearly don’t want that). Then, we look for patterns of DNA which are junk (apparently, we won’t get any useful info out of those). So, after these (and a lot more similar) quality checks, we trim the sequences, we filter some sequences, or even throw away the FASTQ file as needed.

If the FASTQ file is good, then we proceed for alignment. This is where things get a bit more interesting, because alignment is very much a string searching problem. If we’re able to afford ~20 GB of RAM (which we can, given the low cost of cloud services like AWS or Google Cloud), then suffix arrays could be used for the exact string matching problem. We build the suffix array for the reference genome (takes around 20-30 mins) and binary search for the occurrence of the FASTQ sequence. Once we find the position, we can find the chromosome it belongs to (if it’s less than ~250 million, then it’s definitely the first chromosome and so on).

But, that wouldn’t be enough. All the sequencing machines make mistakes. What if we encounter a “N” along the way? What if the machine wasn’t very confident about the base it just read? What if a mutation has happened at a particular position? What if a virus infection deleted a bunch of nucleotides consecutively from the sample genome?

Statistically, this happens 20-30% of the time in a FASTQ file. Apparently, it’s an approximate string matching problem. So, how do we solve it?

Almost all the sequence aligners out there use the magical Burrows-Wheeler transform. It’s much like a wrapper over the suffix array (as a matter of fact, for longer strings, you need the suffix array to build the BWT - the usual sorting could take days!). Once the BWT is in place, we use the FM-index to find things in O(1) time. For the human genome (without any optimization), you may need ~25 GB of RAM to do this, but it works pretty well, and it’s worth every penny!

Moreover, since the reference genome doesn’t change (until we get the next version), its BWT will be the same. It won’t take more than 5-6 mins to build the FM-index from the BWT.

$ wc -c ref_genome gram_bwt
3137454505 ref_genome
3137454506 gram_bwt     # BWT needs one additional byte
6274909011 total

I’d written an Rust library for this. Let’s take a look at its powers! I’ll try the first sequence from the FASTQ file I just showed you above.

$ ./gram -c GCAGACCCAGCGGGGCATGGGCGGACAGAGCCGCAC
[["chr16",69776079,"NOB1"]]

And, we have a wonderful exact match! The sequence matches to position 69776079 in 16th chromosome, which is the range of the NOB1 gene (don’t worry about genes for now, we’ll get into it in future posts). You can check this with Wolfram Alpha (you know, just in case, you don’t believe me). The time taken for the Rust backend⁴ is ~50μs.

This means, we can align ~5000 sequences at a time (if they match perfectly) using a single CPU, and that’s lightning fast!

Magic behind the FM-index

Now, we come to the question of how it’s so magical, and how we go about approximately matching a sequence. In order to generate the BWT, we append a null byte to the string, get all its rotations and sort it. For a string, say, “AACCG”, we append the null-byte, say $ to its end, and get the sorted rotations like this,

$ A A C C G
A A C C G $
A C C G $ A
C C G $ A A
C G $ A A C
G $ A A C C

Here, the last column G$AACC is the BWT of AACCG$. The suffix array of the same string is [5,0,1,3,2,4]. If we map the suffixes to their indices in the suffix array, we get this,

$
A A C C G $
A C C G $
C C G $
C G $
G $

And, we find a strong resemblance. They share the first column. The FM-index has the BWT and some additional information regarding the first column and the BWT itself. Using this, we can narrow down our search space. If your query starts with “C”, then you’ll know that “C” suffixes lie in the range [3, 5] in the suffix array. For the next base, you feed the previous range along with the base, and the index will return the next range.⁵ If it’s a valid range, then the sequence exists, or if it’s invalid, then we’re unlucky.

I won’t go into the details here (I’m excited to talk about it, but this post is already big!).

Let’s try a sequence…

$ head -4 SRR413130.fastq
@SRR413130.1 A2097DABXX:4:1:2222:2104/1
GGACNGAGTTATCGAGGCACATACTCCACCACTGTCACAGGAAGAACCT
+
AA?@#BCCCCEGGGGGGEGGGGFGGGGGGGGGGGGGGGFGGGGGFGGGG

$ ./gram -c GGACNGAGTTATCGAGGCACATACTCCACCACTGTCACAGGAAGAACCT
[]

We don’t have a match, because we have a “N” there (most often, that’s the case). Let’s try query’ing the sequence after the “N”…

$ ./gram -c GAGTTATCGAGGCACATACTCCACCACTGTCACAGGAAGAACCT
[["chr6",-161038213,"LPA"],
 ["chr6",-161043759,"LPA"],
 ["chr6",-161049303,"LPA"],
 ["chr6",-161060396,"LPA"],
 ["chr6",-161065943,"LPA"]]

Whoa! We have a lot of matches now. They all belong to the 6th chromosome and “LPA” gene, but the positions are negative. Why? That means we have matched its reverse complement AGGTTCTTCCTGTGACAGTGGTGGAGTATGTGCCTCGATAACTC. We’ve reversed query, complemented the bases (A for T, C for G, etc.), and then we get a match. Let’s try it.

$ ./gram -c AGGTTCTTCCTGTGACAGTGGTGGAGTATGTGCCTCGATAACTC
[["chr6",161038213,"LPA"],
 ["chr6",161043759,"LPA"],
 ["chr6",161049303,"LPA"],
 ["chr6",161060396,"LPA"],
 ["chr6",161065943,"LPA"]]

There we go! The same matches (in forward direction). What this really means is that we’ve matched the other side of the double helix (which is fine too, since both are from the same DNA). Now, let’s try the whole sequence, changing the “N” to “A”…

$ ./gram -c GGACAGAGTTATCGAGGCACATACTCCACCACTGTCACAGGAAGAACCT
[["chr6",-161038213,"LPA"],
 ["chr6",-161043759,"LPA"],
 ["chr6",-161049303,"LPA"],
 ["chr6",-161060396,"LPA"],
 ["chr6",-161065943,"LPA"]]

And, we have the matches!

Let’s have a look at the ranges while querying FM-index…

$ ./random unmatched_SRR413130.fastq
@SRR413130.12176383 A2097DABXX:4:62:1132:102596/1
AGAGCGGAGGCAGGAGTTGGGCCCCAATTTGCTTCACGTNAAATTTATG
+
DDDDDDDCDDD:CD=;2<<?CBDDCBBBBBBBBBB;>:7#08665BBBB

$ ./gram -c AGAGCGGAGGCAGGAGTTGGGCCCCAATTTGCTTCACGTNAAATTTATG
[]

So, we have another sequence which doesn’t match. I’ve modified the wrapper to show us the range output for each base. Let’s feed the bases one by one…

$ ./gram -c e:::G
(G, 1449387873, 2042847480)

$ ./gram -c e:1449387873:2042847480:T
(T, 2642364268, 2853319690)

We begin from the last base⁶ “G”, and we get a range. Then, we feed the second last base “T” with G’s range, and I get a new range. It’s valid. Now, let’s try feeding the whole thing.

$ ./gram -c AGAGCGGAGGCAGGAGTTGGGCCCCAATTTGCTTCACGTNAAATTTATG --debug
(G, 1449387873, 2042847480)
(GT, 2642364268, 2853319690)
(GTA, 730337789, 783909233)
(GTAT, 2436460939, 2448269684)
(GTATT, 2902060112, 2905545457)
(GTATTT, 3044849837, 3046261437)
(GTATTTA, 832453834, 832844405)
(GTATTTAA, 276956227, 277060107)
(GTATTTAAA, 109010853, 109046976)
(GTATTTAAAN, 2042847950, 2042847950)
[]

Apparently, we’re losing wonderful matches just because of a few accidental/incidental substitutions, insertions or deletions (and like I said, this happens 20-30% of the time). In our case, it has stopped at “N”, because GTATTTAAAN has returned an invalid range. This is how we approach the fuzzy string matching. All we have to do is once (and whenever) we encounter an invalid range, we try querying a new base with the previous range, and backtrack from there. Since there are only 4 possible bases (and since the FM-index is fast), depending on our algorithm, we won’t be risking a lot of computing time unnecessarily.

As for this query, changing the “N” to “C” will return a match. This is a mismatch. Insertions and deletions can happen too, and it’s up to the aligner to decide which alignment to take/leave.

However, limiting the depth of backtracking is up to our resources i.e., how much edit distance we’re willing to allow (the more we allow, the more we’re prone to bad sequences and end up spending more computing time, and disallowing them entirely results in leaving out the most important sequences).

Now that we’ve aligned the FASTQ file to the reference genome, the next checkpoint is to infer the alignments - like how many gaps/insertions have occurred, whether a particular substitution is a mutation or whether it’s a machine error, or whether the alignment itself is wrong, etc., but that’s for the next post.

Auf Wiedersehen…

Then, there’s the X and Y chromosomes which determine sex. ↩
Actually, it’s every cell with a nucleus. That’s because not all cells have a nucleus - even though the red blood cells, the cells in hair, skin, nail, etc. start with a nucleus (with a DNA), they destroy their nucleus as they mature. ↩
I’m choosing the simpler case here, because nowadays, a machine generates two FASTQ files - one belonging to each strand of the DNA (one will be in the forward direction, and the other will be in the reverse direction, as if their tails are tied up). When we analyze the files, we get reads from the paired files at the same time. ↩
I have a simple TCP server that listens to a particular port for sequence requests (or job requests for running an entire file!). That’s just a client making a request and printing the response. Rust’s concurrency primitives are rather charming to work with, and it’s always pushing me to write parallel code. ↩
The range is very useful by itself for counting the occurrences of substrings. The difference between them indicates the number of occurrences. ↩
That’s because I have a forward BWT, and it’s got to do with how the FM-index works. If I had a BWT of the reverse genome, then I’ll be querying from the start. ↩

Three months…

2017-02-11T23:24:29+05:30

It’s been 3 months since I’ve blogged. A lot can happen in 3 months. Someone who was once close to you could leave you, your dad could lose his job (trembling your financial support), you could get to the point when you’ll no longer want your bachelor’s degree, etc.

In the midst of this, wonderful things can happen too - you could get invited to the awesome Mozilla “All Hands!”, where you get to see all kinds of interesting people from allover the world (including the ones you’ve been chatting in IRC), you get to hang out with them, you get to live with them for a week, FUN!!! ¹

And, once you get back, you get the promise of a superior distraction - a nice little project that could keep you distracted for months!

You see? Balance!

Anyway, I’ve planned to write a bunch of posts about my work in bioinformatics, especially the project I’m involved in right now. I’ll make sure that they’ll be interesting by having a mixture of bio-stuff and coding. See you until then…

Well, you could also get allergic to seafood (lobster, in my case) and get “hives”, but meh… ↩

An easy bug in Stylo…

2016-10-05T08:30:34+05:30

While my new job demands writing backend tools in Rust, I get a lot of free time every once in a while, when I fiddle around Servo’s code. Lately, I got interested in Stylo.

Stylo is interesting enough for it to need a whole writeup about itself, but this post is just about an easy stylo bug, which then turned slightly ugly. Well, it’s no big deal, since developers usually deal with this kind of thing every day, but since it’s an easy bug, I thought it might give some ideas to the newcomers (to stylo) about where to look when hacking on stylo, and to keep pushing and not give up if an easy issue becomes less easy…

“Stylo” in a nutshell!

There’s parallel style code in Servo and sequential C++ code in Gecko. In stylo, we isolate Servo’s style libraries and hook it up to Gecko (with a sleek FFI) and make it use that instead. Now, that’s easier said than done, but once we have this integration, we can focus on pure style stuff, without having to worry about unimplemented layout/rendering stuff in Servo (since a Firefox build will provide feedback on how things are going).

Stylo contains both Gecko and Servo code. The workflow is somewhat difficult for a newcomer, because sometimes it demands submitting patches to both Gecko (hg repo) and Servo (git repo), dealing with codegen (Mako for the glue code, and a version of rust-bindgen for translating numerous C++ stuff to Rust), and finally testing them (when you build stylo and manually check whether your changes work).

How it began…

A “good first bug” in stylo usually goes about changing (or adding) something in the glue code. Manish had scraped a few pages and put up a list of CSS properties, which comes in rather handy. As we can see, there are some stuff that are implemented in Servo, but not in Stylo. For those properties, the changes reside in the glue code (mostly)¹, where we only have to get the computed values from Servo and set it on Gecko.

I’d done border-spacing a few days back. It was pretty easy, as it required nothing more than copying some values from Servo to Gecko. The next thing in my queue was font-stretch, which “looked” pretty similar.

The core principle behind style code is that each property belongs to a particular type! It’s always a struct field (or a bunch of fields) in Gecko, whereas it’d be an enum or a struct in Servo.

font-stretch turned out to be an enum in Servo, whereas it’s a 16-bit signed integer in Gecko. And, it wasn’t straight-forward (like I thought it’d be).² Whenever we encounter an enum, we can easily cast it away to an integer primitive. But, we can’t do that here, because some of the constants were negative. In order to keep things future-proof, we need the constants in Servo before we can do anything.

While we already have most of the types and values, importing a few more is pretty easy. Emilio had done a great job with bindgen, that we now have a bunch of tools for generating the necessary bindings required for the glue code. So, simply including the file and adding the constants’ pattern should do it.

… or so I’d thought.

There was a slight trouble with the bindgen along the way, but once it got fixed, I could generate the bindings in no time. Surprisingly enough, bindgen then seemed to ignore constants with negative values. So, it was time to get into bindgen code.

A bug within a bug…

Parsers never cease to impress me.³ As complicated as they look, they can never be perfect, and always have bugs! Rust bindgen is something that translates C++ code units to Rust (with support from clang libraries). It also has a parser. So, with the current scenario on hand, it’s natural to assume that we’re ignoring the negative values whenever we parse #define directives in C++ code.

Initial digging showed that this is where we filter the collected Rust constants (translated from #define) with respect to the whitelist of patterns, but throwing some println! there showed that those set of constants never even get there in the first place!

After more digging, it turned out that we skip parsing if we don’t find an integer literal in a code unit. In our case, since parentheses and unary minus don’t count as literals, they’d been neglected by the parser.

Finally, a patch to bindgen followed by another patch for bindings regeneration was enough for making my actual glue code patch to work.

This is what I like about Stylo. It’s hard, interesting, new and definitely not straight-forward.⁴ So, I’m planning to keep fiddling around it for a while…

Mostly… but not necessarily. It could also mean that we can’t write the glue for that particular property easily, because either the Gecko code is complicated, or transforming the values from Servo to Gecko has some complication. ↩
There are tons of unimplemented easy properties in stylo (like this one). Some are pretty straight, while others require a bit of hacking, and spending time with both the codebases. ↩
They attract me so much that whenever I get to work with them, I tend to spend more time admiring the existing code rather than concentrating on the particular problem on hand. ↩
I admit. This consumed a few hours of mine. If I’d asked around, then maybe I’d have fixed this within an hour or so. But, where’s the fun in that? ↩

New job! New field of science!

2016-07-12T20:36:59+05:30

It’s been about a year since I’ve blogged. A lot of stuff has happened in the mean time. I became a reviewer for the Servo browser engine - especially the python code (which felt good), attended a flight training program at IIT, Kanpur (which was pretty fun), had a war with some of the professors (which has postponed my bachelors degree, meh), and now I’m working for a bioinformatics company, writing production code in Rust (which is cool!).

While I was doing my final year project, I applied for an internship at a bioinformatics company. For the first week or so, it was just python and shell scripts (boring stuff, really), until one day, I ported some of the python code to Rust and gave a demo on that. That was it! From then on, until the end of the internship, and now, my job, is totally on Rust!

All these days, they’ve been using third-party tools for their analysis, connecting them with shell pipes, tinkering the output with a few scripts, and finally bringing it to the front-end. Now, there’s an opportunity (for them) to break their painful dependencies, research on things, write stuff from scratch, while I can get deeper into systems programming, and simultaneously get an actual experience in writing production code (in Rust!). So, it’s a “win-win” situation!

As an example, there’s this Java-powered tool called FastQC which analyzes FASTQ data. With the help of some bioinformatics fellas, it’s pretty easy to reconstruct what the tool does, from its output. By the end of the internship, I was asked to rewrite the tool (ASAP!). They helped me with the spec, and it took me exactly 12 days (~70 hours) to write the Rust version of that tool. ¹

The initial version doesn’t have any kind of unsafe code, and I didn’t optimize it very well. It only utilizes the APIs in the standard library for efficient reading, data storage, and parallelization, but it was already ~20% faster than its (carefully crafted) rival, and there’s no limitation for this. Now that we’ve got our own tool, we can have as many features we want! ²

Anyway, I’m pretty sure we’re the only ones using Rust for production in our state (right now), and I’m quite happy about this, because my job allows me to play with the language I love, and I’ve got more than enough learning space here. So, maybe I’ll stay around for a while, and see how far this is gonna take me…

Even though there are languages specifically designed for scientific computing (like Julia, for example), I personally believe Rust has a great future in big data analysis!

I’m planning to blog about it soon. ↩
They were actually impressed by this, and it got me the “game changer” award :P ↩

100 shades of green: The journey of a coder…

2015-09-07T18:33:18+05:30

I’ve been coding for about a year now. I’ve danced with Python (a lot!) and nowadays, I’m playing with Rust, although I’ve also done some basic C & Javascript. Anyways, I get a lot of questions from my fellow undergrads about how I got into coding in the first place, and yesterday, Manish gave me the idea to blog about it.

Also, since my commit streak has reached 100 days (with 1k commits), I think now’s a great time to share my story with y’all…

On a side note, this post is intended for those who’re about to get involved in the art of coding, though I assure the rest of you that it will be interesting for others as well.¹

Good ol’ days with the computer…

When I got my first computer, all I ever wanted to do was play games (well, I still do, but I love other things too). Apart from that, I liked to paint (you know, lines, squares, circles, colors, wheee…). Though I did encounter C in my school days, I didn’t know the purpose of it, because I simply didn’t realize the powers of coding. The only thing that attracted me was HTML (4), because it can do some pretty stuff - that way, I realized that with some text I can make colorful things on a computer!

I discovered many things during my high school days, but the only thing that mattered the most was that characters in computers can do more than just coloring - that was the time when I played with matrices and generated prime numbers in C (basic stuff you learn in high-school). That was also the time when I had developed this weird desire to fiddle around the things which are hidden from plain sight. I just don’t like it that way.

For instance, take the beautiful “Microsoft Windows XP” (the OS of my first computer) - the C:\\ drive is forbidden by default, and it has a lot of files which are unknown to me. I had a friend who shared my trait of fiddling around with unknown files. Since his father owned an internet cafe, whenever we’re lost, or screw up something (or get screwed up by a virus, mostly a worm), we get a solution almost immediately. Later, we take those problems to our school, so that we can screw up the computers in our labs (let’s just say that we hated some of our teachers and we wanted to make their lives miserable!).

That was the time when I also got interested in the command prompt (which I terribly hate nowadays, since I got into bash). For example, we used this tiny snippet for recursively walking inside a path and hiding the contents inside. It was used often by us to drive the teachers and students crazy (for a month!).²

attrib [path] +h +r +s /s /d

As a newbie to the internet…

It wasn’t until college when I got my first laptop along with a pathetic excuse for an internet connection (which I still have). By the end of my fresher year, a great deal of things had happened.

I got into Stack Exchange, which had some great impacts on my personality over the years (that’s also where I met Manish). At the start, I had the motivation to explore the system, interact with a lot of users, and learn to contribute, but once that became handy, I got bored. Yeah, reading stuff on the internet, participating in a great community is awesome and all, but I needed something more - something that could keep me equipped when I’m bored of reading.

I got excited about math and wanted to solve problems - it was right about that time when I discovered Project Euler. It demanded the users to code to solve the problems. I’m sure every newbie coder gets interested in those problems, because it demands only the solution for a particular problem, which means you ought to guess an algorithm first (sometimes, it’d be the awful bruteforce), try to implement it (when you’ll discover more about the language of choice - with the help of Stackoverflow, of course), get the result and then you can refine the algorithm (unlike SPOJ or Code Chef, which are designed to concentrate more on algorithms and data structures). In those days, I used C.

My sophomore year began with blogging - when I was mostly writing about physics-ish stuff (whatever I got excited about) or my favorite experiences (or ranting about something). At this point, I knew about Mathematica and its glories. Luckily, the Wolfram community had offered plenty of documentation for it, that I was able to learn its basics pretty quickly.

Now, I had no reason to use C for solving the problems, since Mathematica had all the built-in functions available in hand. I used it to generate plots and some simple animations, which I later embedded in my blog posts. One thing led to another and soon, I was making animations for teaching my classmates to visualize things.³

Along the way, I also got fond of a markup language - which affected me so much, that I got addicted to just seeing its beauty. It was the awesome LaTeX! I used to prepare notes for some of my courses with it, just because I felt happier to read all those silly formulas in LaTeX. Once I got to know about its clockworks, I began using LyX and Geogebra to speed things up.

Hands on an easy, aesthetic & powerful language…

Then, I discovered Github. I didn’t understand the point of it at first, and so I assumed it as a “dropbox for coders” - I didn’t know about the concept of version-control or open-source. Anyway, the repositories served nicely as a backup place for all my code.

I’m a perfectionist, alright? I admire the beauty of things when they’re neat, but I often struggle to maintain their clean state as time progresses. Since I realized that LaTeX seemed to consume most of my time, note-taking also came to an end. And, Mathematica (the only language I was currently working with) seemed too abstract, since it hides all the details from my view. Now, I wanted to know how things work behind the cloak, but I also don’t wanna fall back to C. (meh)

When I was a sophomore, I used to hang out a lot in our physics chatroom, the H-bar and most often, I could hear about Python. When I asked about it, they told me that the scientific community is using Python most of the time (from handling the data to plotting the results), especially because it’s easy. “Hmm, looks like something I should try out…“, and I got curious just like that.

By then, I somehow got interested in cryptography. After reading about the old Caesar & Vigènere ciphers, I got a desire to create my own cipher (that often happens if you get too excited about cryptography without thinking about the past few decades of research), and that little project of mine consumed my entire vacation at the end of my sophomore year.

On the brighter side, this was the point where an endless discovery was going on. As days passed, I dug deeper and deeper into Python. Some new idea kept popping up every now and then, and I immediately implemented it. Man, I was totally productive! As a free perk, I also translated the code to Javascript (by which I learned some HTML5 and JS along with a bit of CSS). This was also where my life as a coder began…

The day came soon, when I learned some stuff in public-key cryptography, when I realized about the depths of cryptography, when I’d also finally decided to ditch all my work because it just seemed too stupid to keep on developing a nonsensical beast which does nothing but consumes more time and memory in the name of a “non-conformant pet cipher”! I had to move on. By then, I knew some serious stuff in Python (thanks to Stackoverflow which helped me to learn all the way down to its internals), and so I used it to reduce the work in my academic stuff - like grabbing data from the lab machines, minor computations, iterations and plotting (which simply took too much time in Excel - some of my classmates realized that later).

Every time I encountered a problem requiring repetitive steps, I used Python. Since Python can do many things (given the vast amount of packages it has) - I used it to crawl through webpages to download stuff, switched to Python for solving those old Project Euler problems, clean my files (and sometimes, dirty code). Like I said, I’m a perfectionist.

After a few months, I got an idea (for another little project!). I often forget what happened every day (which is one of my problems which I had to solve) and so, I wanted something to keep track of my memories. That “something” was my next project - a diary. Apart from Mozilla, that project was something that consumed a great deal of time, and every single line in it was molded by me over the months. Ideas popped up in the same way - “Let’s have an option for searching!”, “Authentication per session would be a great idea!”, “Hmm, wouldn’t it be wonderful to have CBC before shifting?”, and I implemented those as and whenever they cropped up in mind.⁴

Into an awesome community - open-source and beyond…

Anyways, that thing had taken some of my time and since I was indulged in it, I never bothered to look into Project Euler, and so my problem-solving days were done (I simply didn’t get the mood!). Now, I needed something else - something I’ve never tried in my internet lifetime.

Then, I recalled Manish’s talks about Mozilla, Rust, etc. He had often told me that Mozilla’s one of the most welcoming communities to get involved. One of the most important things which Stack Exchange taught me is to “search more before asking questions!” Because, we often get silly questions at the site, where some users don’t even bother to put some effort on their questions. The point is that it was pretty hard for me to ask questions (especially when it’s a new place). Also, I had never heard of “IRC”.

I was afraid. The gigantic codebase and the numerous code it contained freaked me out! I’ve never seen a codebase that big in my life! “Millions of lines of code? C’mon! Seriously? It’s just a browser!”. I never understood the bugs, nor did I know how to patch those. Soon, my vacation began, and my sole aim was to get involved in a Mozilla codebase! ⁵

I got into IRC and discovered that they offered mentors for new users. I got one for myself, who even suggested me a “good first bug”, which took me some days to finally submit a patch. That also required me to get into Ubuntu and a new version control system called Mercurial. My first bug explained me how things worked over there. It also came with a perk - I never used Windows from then on! I realized the awesomeness of Ubuntu! That was it, and later bugs became a piece of cake, thanks to those awesome Mozillians.

It was by that time Rust 1.0.0 was released. I never got to play with a low-level language. The release of Rust and its success was my motivation. I got excited about it, and immediately got indulged in it (with the help of the wonderful docs & book). Now, an idea popped up while reading about concurrency and FFI in Rust (that was my first time reading about “FFI”, though I had some idea about concurrency). Anyways, that diary I’d written was completely in Python. So, If I could somehow link it through FFI and hand over the searching to Rust (and utilize its concurrency), then I could save a lot of time. And, it did.

This task consumed an entire weekend of mine (but, it was worth it!). Though FFI was hard, I got help from a lot of Rustaceans (mostly at IRC for minor things, or Stackoverflow, whenever things went out of hand). I loved Rust so much (just like I loved Python), which is also why I never felt hopeless, and by the end of the day, I got the thing to work! (finally)

Coding everyday…

There’s something that I learned from my journey. Once I got into the art of coding with the help a language, jumping into other languages wasn’t a big deal. Moreover, if you get involved in a project (open-source, or something of your own), then you’ll more likely learn a great deal of stuff than you normally would (though that cipher and my diary wasn’t much useful, I did learn a hell lot of things while working in them).

Also, since I’d gone for a high-level language (Python), I didn’t have to worry about the types very much, but I did get into trouble when I tried to get into a low-level beast like Rust. Because, Rust doesn’t abstract away many things like Python does, I had to do quite a few tasks by myself! Moreover, Rust’s design made my coding pretty challenging, because of its static type system and its merciless borrow checker (among other things). I had to pay the iron price! (for my choice of a high-level language at the start). But, I had quite a lot of fun while playing with it.

Nowadays, I’m involved in the Servo browser engine, to learn more about Rust, which gets pretty exciting every day. And, while I’m playing with Rust, I’ve also got back into C, because firstly, I’ve realized the power of these low-level languages (plus, C doesn’t have any abstractions at all!), and secondly, I’m about to finish my college and industries only know about established languages like C/C++/Java. So, I gotta learn one of those to get a job (sigh).⁶

Thanks to Manish for patiently reviewing this post…

If you’d grown in my locality (in your childhood), then all you know is that a computer (which you often see in school) is “something” where you can only type, draw, and play games - nothing more! I’m not embarrassed to say that we were never curious! ↩
Well, we learned that from a virus (and that too because we never believed in the concept of an anti-virus, since everything is just the cause & effect of code). This command is often utilized by a harmless worm which infects pen drives. ↩
It was for a course, where we draw vector diagrams for the velocities and accelerations of various parts of some machine. I’d even made some boring videos for the junior undergrads using those animations. ↩
It’s not a great project and all (I created that only to dig deeper into Python, and I have!). And yeah, I do write my story every day (I’ve got about a year of stories now). I don’t usually spend much time other than just recalling the events, and I just write a paragraph or two (like hints) - to just remind myself that something so nice/worse has happened on that day. ↩
Meanwhile, I got into a hackathon, when I learned a lot of stuff - more JS and some frameworks - especially the MVCs like Django and Angular JS, and I’d also figured out how Git worked. ↩
Well, to be honest with you, based on the things I’ve seen so far, I personally feel like C being quite easy, since I’ve learned the “Rust”. ↩