Florian Leibert, CEO of Mesosphere, recently sat down with Stavros Korokithakis to talk about technologies that are being used to build the web today and tomorrow. Stavros is the founder of Stochastic Technologies and runs his own insightful and humorous blog, Stavros’ Stuff. He’s also active on Hacker News under his handle, StavrosK. Flo and Stavros talked about what makes Python and Django attractive for DevOps and data science. They also explore InterPlanetary File System (IPFS), a technology rapidly becoming the foundation for the future distributed web.
Python in DevOps
Florian Leibert (FL): Python is used in DevOps with tools like Puppet and Chef, plus it’s also popular in the data science community. This puts Python at an interesting crossroads. What gets you super excited about Python?
Stavros Korokithakis (SK): I fell in love with Python years ago. Then I found Django and started creating web apps with it. My take is that Django is a great framework and Python is a great language. I love to use Python for web development, and web development is my core focus.
Matt Sarrel (MS): I’m curious, early in your career, what made you choose to focus on web development and languages like Python and Django?
SK: I like the web because it’s easy to deploy applications that are accessible from anywhere. It’s very easy to synchronize data between client and server. You can deploy applications to many, many people without them having to download anything, with zero hassle. This is kind of where containers helped a lot as well, because it used to be a big hassle to deploy my applications on the server without containers, but with Docker and other solutions, now it’s much, much easier and faster.
FL: Speaking of containers, how do containers make your day-to-day workflow and development process with Python easier?
SK: Packaging for Python helps greatly with managing dependencies and creating repeatable deployments. When a developer joins a company, for example, we can easily provide him with the exact same stack as we’re running in production on their local machine with just one command. This decreases friction and increases the speed with which the new developer can start writing code. It also makes sure that what they’re developing is the same as what will go into production. Developers frequently cut corners when building their local environments and as a result, while your code will run fine on your computer, when you deploy it, it breaks.
FL: This has some easy to see advantages. What do you think is the biggest hesitation in companies to adopting these new patterns?
SK: The biggest hesitation is that they’re reluctant to change their existing workflow. Another big hang-up is around security. When you deploy code directly on your operating system, you know that the server runs these packages, the operating system is that version, and everything works fine. You have a server that’s patched with the latest security updates. The containers may or may not have been patched, so you have to make sure that every container gets rebuilt with applied updates, rebooted or restarted. You need to keep track of everything running on every container to do security right. The reality is that this isn’t that hard to master, but until companies get the hang of the process it can hold them back.
Desert Island Python Libraries
FL: What are the three to five Python libraries that you couldn’t live without?
SK: Hah, for what I do, three to five is not a very big number! I wrote shortuuid way back in the day and I still use it every day. It generates random UUIDs and also short strings that you can use for human readable IDs. Another one is schema, which validates any kind of input against the schema you’ve chosen. I use the Hypothesis library for testing a lot. It makes fuzzing and testing your code very easy. I use things like Flake8 to do static checks on code, PEP8 checks and things like that. And Django, of course, because that’s a library as well.
FL: Great. That’s awesome. Let’s talk about the library that you wrote, shortuuid. Why did you create it and what do you need all of these UUIDs for?
SK: First of all, whenever I create a database model, like a table and a database, and I need an ID, I never know when the table, or the date on the table, is going to be user-facing. I don’t want users to be able to guess how many of each record we have by creating a new one and seeing its ID. If you have numeric auto-incrementing IDs, the user can just create a new object and say, “Ah, these guys have 1,318 objects in the database.” With UUIDs, they’re not enumerable, so it’s much harder for somebody to gauge how many you have in the database. Also, it’s very helpful to be able to generate UUIDs that are more readable and writable than UUID4. For the UUIDs, we can avoid collisions between two servers creating them independently of each other.
FL: How do you ensure that there are very few collisions?
SK: The library generates a UUID4 which doesn’t actually ensure that there will be no collisions. But the space is so big that you’re more likely to win the lottery than get a collision. With the library, you can create one UUID, and if it exists in the database, creates another one. If it exists, create another one. Do that three times and you’re probably guaranteed that it won’t exist.
I also wrote a secondary function to generate cryptographically secure random data. It’s useful for generating cryptographic keys or things like that. As UUIDs themselves are not very good for cryptography. I use an os.urandom() to get random entropy from the entropy source of the operating system.
Python in Data Science
FL: Let’s talk about Python in the data science community. There are a fair amount of notebooks that are powered by Python, right?
SK: Yes. The IPython notebook kind of created the whole dynamic notebook movement. It became Jupyter, which runs many languages. Jupyter/IPython notebooks work really well for presentations or for running code live while you’re doing your presentation, showing the results live, and playing with them as you’re displaying them. That’s very much powered by Python, and the whole data science community uses those extensively.
FL: Are notebooks used for exploratory data analytics as well?
SK: Yes, exactly. It’s very helpful to be able to explain things in the notebook. You can intersperse text, graphics or whatever you want. Plus, you don’t lose state if you want to load some more code or experiment. I use them when I want to get some data out, play with it, prototype something, and then migrate it to a production setting. It’s a great process for exploratory programming.
IPFS and Excitement for the Distributed Web
FL: What other technology out there are you really excited about?
SK: I like IPFS and I’ve done a few things using it, just because I think it’s nice to have a distributed layer over the Internet where things aren’t centralized behind servers. You can access web pages, store them, and seed them out to other people, while at the same time with each access you prevent the content from getting lost or going stale. I also like that it creates a layer that will never go away with links that will never die.
IPFS is like a cross between Git and BitTorrent. It runs nodes that expose a file system. Files stored on IPFS are addressed by the hash of their contents. Later, when people put the same file on IPFS, they will get the same hash as the file’s ID. Then you request the file by that ID and you will get it from either of those two people, whoever is closest to you and available. When you get the file, then you also share it. The more people that access the file, the more available it becomes and the more well-seeded it is. Since the file is always addressed by its hash, even if you change the file, the hash stays the same and links to the old version while the new version gets another hash. This is a great way of having immutable links on the internet and being able to publish your content without central servers.
FL: So there’s a huge scalability and availability play with this as well?
SK: Yes, exactly. They’re planning on having thousands or millions of open nodes being able to publish and retrieve content. Neocities, which is kind of a big GeoCities clone, publishes all their sites on IPFS. You can always access the old version of the site and the site is always available. It’s really nice.
FL: How do you think that IPFS is going to change traditional way we think about stacks?
SK: Well, I hope it does. It’s a different thing. I think there are going to be use cases for IPFS and use cases for traditional web stacks. IPFS is great for when you need to make sure something can be served from multiple machines and you want to preserve versions. This gives us a whole new way to think about web servers. You wouldn’t run a full app in IPFS, but you’d certainly be able to do a lot of things in it.