Python -- The Best Parts

I have been something like a “professional Python developer/programmer” for ~5 years. I write in whatever languages I need to use and make the given task easiest, but I have been paid some good money to write a decent amount of Python. During this time, I have designed, written, rewritten, tested, deployed, debugged, and deprecated production Python applications that do real work for real people in the real world. In this post, I describe a few qualities that make Python pretty neat.

Python makes for a fairly portable upgrade of Bash , especially for small <=1000 lines single file Command line interfaces /shell scripts. In many cases, Python compares favorably to medium-complexity Bash scripts, while gaining some Windows portability. If, when writing a Bash script, one hears the call of serious work with iteration, complex conditional branching, arrays, non-string data types, stricter/more formal means of describing function arguments and return values, or containers/data structures + algorithms, consider using a single Python script with no third party packages/dependencies.

Here are some standard library modules and language-level design decisions that enable Python’s effectiveness in certain environments. All of the following are built into recent versions of Python. This post was inspired by the book Javascript: The Good Parts , which aims to prove that such a book, though small, is indeed larger than a blank page.

pathlib Cross-platform, immutable/value-typed path objects with convenient methods such as Path.glob("**/**/*.parquet", case_sensitive = False) -> Iterator[Path] , identifiers such as Path.is_dir() -> bool , Path.exists() -> bool , Path.is_file() -> bool , idempotent operations such as Path.mkdir(parents=True, exists_ok=True) and Path.touch(exists_ok=True) .

Augment this with shutil.copy , shutil.copytree , shutil.rmtree , shutil.move and shutil.copyfileobj , and the metadata-preserving shutil.copy2 , and we are absolutely cooking! Organizing files is actually very important in my line of work and many tasks become easier when one works with his or her filesystem of choice, rather than against it. One can do some heavy lifting in environments where (almost) everything is a file and Python has serious capabilities for dealing with files. Remote filesystems are trickier because the asynchronous processing story is…not as great for reasons I don’t want to fully describe here, but boil down to “cross-platform task scheduling is difficult.”

Frequently, these destinations for moving files around come from tempfile.TemporaryFile , tempfile.NamedTemporaryFile , and tempfile.TemporaryDirectory . Being cross-platform is the real nice part about the tempfile module, as different platforms have their own preferences of where to keep “workspace”/ephemeral files.

datetime (with an honorable mention for time ) If you actually need to deal with dates, that is a great time to put down Bash and pick up Python. Having something like “algebraic operations” (not in the mathematical sense, nerds) on datetime , timedelta and date is too convenient. Seriously once you throw in zoneinfo for timezones, we are cooking.

from datetime import datetime, timedelta, timezone
from zoneinfo import ZoneInfo
from pathlib import Path

# timezone is mainly useful for the 1-2 most common ones, and if you know the
# exact offset from UTC. Otherwise looking up the system's timezone database
# (with the option to fall back on a pypi package if the system does not keep
# such a timezone database) is extremely nice.
def one_hour_after_modification_in_nyc_time(path: Path) -> datetime:
    return (
        datetime.fromtimestamp(path.stat().st_mtime, tzinfo=timezone.utc)
        + timedelta(hours=1)
    ).astimezone(ZoneInfo("America/New_York"))

Try doing that in Bash!

collections deque and Counter nicely complement dict , list , set , and frozenset in the global scope as my go-to/workhorse “dynamic data containers.” While I do use collections.namedtuple frequently, I typically use it via typing.NamedTuple .
Good support for processing common formats for structured data. I would include json , zipfile for zip archives, tarfile , csv for many lines of “delimited” data, tomllib , and gzip in this class of modules that makes Python useful for files we see “in the wild.”
typing Everything in Bash is a string. After the bash script exceeds ~250 lines, you will just want things like integers, booleans, floats, composite/product types, tuples, etc. While some widely-used libraries, such as Pydantic and FastAPI , base runtime behavior on type annotations, many prefer to use them as extended documentation, along with a litmus test for overly complicated implentations. If type-annotating a module/package is too difficult, possibly involving extensive typing/generics masturbation, then your program might benefit from restructuring. Or it might a skill issue.

More often than not, a little type annotation goes a long way towards maintainability and effective static analysis.
argparse Fine enough argument parsing with a good path to good error handling/reporting.

logging The simplest way that I might introduce logging in a CLI is as follows: import logging, declare a global/module-scoped logger, invoke logging.basicConfig, and log away!

#!/usr/bin/env python3
"""An example showing common features of the _many_ Python CLIs that I
write."""

from __future__ import annotations
import logging
from collections.abc import Sequence
from argparse import ArgumentParser, Namespace, ArgumentDefaultsHelpFormatter
from pathlib import Path
import logging
import sys
from datetime import date, timedelta

LOGGER = logging.getLogger(__file__)

_VERBOSITY_TO_LOG_LEVEL: dict[int, int] = {
    0: logging.WARNING,# No modular arithmetic masturbation for me...
    1: logging.INFO,
    2: logging.DEBUG,
}


class MyProgOpts(Namespace):
    verbosity: int
    start_date: date
    end_date: date


def get_parser() -> ArgumentParser:
    parser = ArgumentParser(
        description=__doc__, formatter_class=ArgumentDefaultsHelpFormatter
    )
    parser.add_argument(
        "-v",
        "--verbose",
        action="count",
        default=0,
        help="Logging verbosity. Provide 0 to 2 times.",
        dest="verbosity",
    )
    parser.add_argument(
        "--start-date",
        type=date.fromisoformat,
        default=date.today() - timedelta(days=3),
        help="The first date (inclusive) of data to process"
    )
    parser.add_argument(
        "--end-date",
        type=date.fromisoformat,
        default=date.today(),
        help="The last date (inclusive) of data to process"
    )
    return parser


def run(opts: MyProgOpts) -> int:
    logging.basicConfig(
        level=_VERBOSITY_TO_LOG_LEVEL.get(opts.verbosity, logging.WARNING)
    )
    LOGGER.info("Beginning sample program")
    LOGGER.debug("Beginning program with opts %s", opts)
    return 0


def main(args: Sequence[str] | None = None) -> int:
    """Returns 0 if and only if we exit successfully"""
    parser = get_parser()
    opts: MyProgOpts = parser.parse_args(args, MyProgOpts())
    return run(opts)


if __name__ == "__main__":
    sys.exit(main())

The decision to make loggers (process-level) globals is convenient from most Python applications, but if you want to run many processes, you either write out a new log file per process, or you need to push logs to a process-safe Queue , likely backed by `multiprocessing.Queue . The former option is typically easier, and depending on how you monitor application logs, may be exactly as simple as using one file.

sqlite This serves, in part, as a reference implementation for the DBAPI2 synchronous database interface. All the homies like Sqlite. Single-file local databases permitting concurrent reads with only serial writes that is very ANSI SQL 2003 compatible is extremely convenient. In Python, you almost never need to use the struct module to write custom file formats to disk. In Python, we generally do not need hand-rolled binary file formats whose only interface is fopen

Context managers are Python’s way of constraining/scoping external resources.

Consider the following.

from contextlib import chdir, closing
from pathlib import Path

def main():
    # Scoped control of the `PWD` environment variable
    # Enter == enter new directory + save current dir
    with chdir(Path.cwd().root):
        print(Path.cwd())  # Exit = back to original dir

    with open("a.txt", "w") as fob: # Enter == open file descriptor
        fob.write("Hello\n")
        # Exit == close the file so other processes can open it

It is a nice way to do “setup + teardown” for some computational context that is also useful for database transactions and test fixtures.

Simple data parallelism. The fact that we can take functional code and easily parallelize atleast the map function is pretty cool.

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from functools import partial
def cpu_bound_func(val, param: int) -> int:
    return 1

io_bound_func = cpu_bound_func

with ProcessPoolExecutor(4) as p:
    res1 = p.map(partial(cpu_bound_func, param=10), iterable)

with ThreadPoolExecutor(4) as p:
    res2 = p.map(partial(io_bound_func, param=15), iterable)

I like calling these pools p because it makes my parallelism come from just 2 characters (plus 1 line to scope the thread/process pool)! And it aligns with Clojure’s pmap .

Some helpful algorithms:
- heapq.n_largest(n: int, vals: Iterable[T], key: Callable[[T], Comparable]) -> list[T] and its sister heapq.n_smallest(n, vals, key) . To round out the reasons I ever use heapq, we have lazily merging sorted iterables with heapq.merge(*iterables, key=None, reverse=False) -> Iterable[T] . I find that I want algorithms more frequently than I want data structures, atleast once I already have quick access to ~6-8 good useful data structures in the global scope and standard library.
- bisect Python doesn’t have a standard “sorted set/map” type. It does give you binary search algorithms that apply to a list, which is frequently enough. Definitely familiarize yourself with the bisect module if you are interviewing for jobs involving serious Python work.
The functional programming modules operator , functools , and itertools are…decent. But if you just picked one to peruse, I would absolutely make it itertools. It is definitely a good idea to familiarize oneself with atleast itertools.groupby(iterable[T], key: Callable[[T], T_Hashable]) -> Iterator[tuple[T_Hashable, Iterator[T]]] and itertools.batched . It is an absolute travesty to have these at one’s fingertips and never even know that they exist. They come up very often. Maybe 60-70% of all programs I have written have been “iteration and selection” or basically “for-loops and if-statements.” Itertools helps with half of that.

A Parting Treat

I am not a Python shill. Python is not great for everything. Slow iteration is a rough issue, but most of my loop iterations are done with C and Rust , but I will come to dataframes later. For now, I will give you two funny parts about Python.

A List that contains itself

Python 3.12.4 (main, Jun  7 2024, 00:00:00) [GCC 14.1.1 20240607 (Red Hat 14.1.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> x = []
>>> x.append(x)

x now contains itself. What do you think x looks like now? My guess would be [[]], a list containng one empty list.

>>> x
[[...]]

Huh? What does that mean? Let’s investigate.

>>> x[0]
[[...]]
>>> x[0] == x
True
>>> x[0][0][0][0][0][0][0][0][0] == x
True
>>>

x contains one element: a pointer to x.

Final Parting Gift

Python 2.7.15+ (default, Oct  7 2019, 17:39:04) 
[GCC 7.4.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import string
>>> string.letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> help(string)

>>> string.letters
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'

I’m just going to leave this here and not explain anything. Anyway, Python has (had) its troublesome corners. But if one stays in green pastures, one can actually get stuff done, go home, plot the destruction of the American empire, read a book, and have a cup of tea.

Conclusion

Python is pretty nice outside of performance sensitive contexts and it is also good at “glueing together” different programs that are more equipped to tackle those performance bottlenecks. The strong standard library makes Python a useful front-end choice for, for example, Apache Airflow dag defintions . If you find a Bash script becoming too complicated, then maybe give Python a consideration.

I will write later about how I prefer to distribute Python applications that have relatively simple dependencies.

A Parting Treat#

A List that contains itself#

Final Parting Gift#

Conclusion#

A Parting Treat

A List that contains itself

Final Parting Gift

Conclusion