Subtitle file exercise, translating Python to Rust
22 Feb 2023I love the movie Crouching Tiger Hidden Dragon (CTHD), especially the bar scene. I tried to find the screenplay for the movie in Mandarin. I found an awesome GitHub project (subtitle-combiner) and started thinking…
subtitle-combiner is great because its Git repository includes the subtitle files (SRT files) for CTHD in many different languages, including Simplified Mandarin, Traditional Mandarin, Pinyin and English. The subtitles are a treasure, it’s hard to find the CTHD screenplay in Mandarin; maybe we can create the transcript from the subtitles!
The subtitle-combiner Git repo includes a Python program which combines multiple subtitle files into one subtitle file. For example, it combines Simplified Mandarin subtitles and Pinyin subtitles into Mandarin+Pinyin subtitles.
subtitle-combiner uses Python so I can understand it; specifically subtitle-combiner uses Python 2 so it would be good practice to translate the project to Python 3 and a challenging exercise to translate the project to Rust.
Converting to Rust
Creating the rust-srt-combiner
Before I began I installed Rust. I’ve been using the Rust Book experiment. The Rust book and most resources teach you to use rustup to manage your Rust installation.
The first real step in converting to Rust is to create a Rust project. The Hello, Cargo! chapter teaches us how to create a new project using cargo new
. Let’s name our project the rust-srt-combiner
:
$ cargo new rust-srt-combiner
Created binary (application) `rust-srt-combiner` package
Our new cargo
project comes with a Cargo.toml
file. The *.toml
format is easy to read, but we should add a description
property to credit where credit’s due. Notice our description property points to the original subtitles-combiner git repo:
[package]
name = "rust-srt-combiner"
version = "0.1.0"
edition = "2021"
# the description we added vvvvvvvvvvvvvvvvvvvvvvvvvvv
description = "A Rust translation of this project https://github.com/gterzian/Subtitles-combiner"
[dependencies]
Translating combine.py
Next we could create the Rust equivalent of a “Python package” (note subtitles-combiner has a Python init.py file), or we could focus on the core logic in the combine.py
file. The combine.py
file starts by creating an ArgumentParser so subtitles-combiner can be run from the command line. We’ll ignore “packaging” and argument-parsing for now – we can run our Rust program with cargo run
and hard-code any arguments. Let’s focus on the most important part: subtitle file parsing.
subtitles-combiner defines these four functions
- read_files
- read_lines
- combine
- write_combined_file
We’ll translate these functions in the order they are executed.
Translating read_lines
read_lines
takes a single file, opens the file in read-mode (rt
) and yields a result for every nonempty line – the yield
means read_lines
is a generator. In fact the subtitles-combiner
project is described as “an example of using generators for creating data processign pipelines”, and links to this presentation on Python Generator Hacking.
How do we translate this Python function to Rust?
def read_lines(sub_file):
with open(sub_file, 'rt') as f:
for line in f:
striped = line.strip()
if striped:
yield striped.decode('utf-8')
Filepaths are more than strings
I thought Python’s open(file...)
function accepts only a string argument containing the filepath, but it accepts a “path-like object”. So does Rust’s File::open(path...)
; the argument path
uses generic type P
, where <P: AsRef<Path>>
– notice the use of the Path
struct which Rust describes as “a slice of a path (akin to str
)”. Rust accepts a simple String
because the String
implements the AsRef<Path>
trait. A file “path” is more than just a simple string in both Python and Rust because a file path depends on the operating system, Linux and Windows use different path separators (/
vs \
), so a single string can only use one path separator, only work in Linux, or only in Windows. But a “path”-like object could be cross-platform.
Close the file when you’re done
Python’s with
statement “wraps the execution of a block with methods defined by a context manager”. That means Python will always call __exit__()
on your context. When the context is a file, Python will close the file when you’re done (when the with
block ends). That’s because a Python file inherits from the IOBase
class, which implements the __exit__
method. Rust closes the file using the file’s drop
function. In Rust the drop
method is comparable to __exit__()
. The drop
function is known as the object’s destructor
and it “gives the type time to somehow finish what it was doing”. Rust destructor is called automatically “when an initialized variable or temporary goes out of scope”.
Taking it one line at a time
The Python file object inherits from IOBase
, which means the file object is a context manager and also that the file can be “iterated over yielding the lines in a stream”. The Rust book teaches us how to read the entire file into a string using fs::read_to_string(file_path)...
. We don’t want one string – we want to iterate over each iterate each line in one subtitle file (so we can combine with lines in another subtitle file). Rust’s std::io::BufReader and the .lines()
method give us a way to iterate lines. You could also make a custom BufReader
which implement Iterator
so you can iterate the BufReader
directly. Not only is this closer to the Python approach (for line in my_reader
), but it also means you don’t have to “allocate a string for each line”.
Truthy and falsey
Python supports the idea of “truthy” and “falsey” which means if line:
will not execute if line
is an empty string. An empty Python string is considered false or “falsey”– so is any empty Python sequence/collection. Rust if
statements don’t work like Python’s; if the condition isn’t a bool
, you’ll get an error. “Unlike languages such as Ruby and JavaScript, Rust will not automatically try to convert non-Boolean types to a Boolean.”. Rust prefers explicitness, so instead of relying on a language feature to treat an empty sequence as false (aka “falsey”), we should explicitly check the length of the string to see if it’s 0.
Decoding from Unicode
The Python read_lines
function
TODO Compare Python 2 and Python 3
Too lazy to be lazy
The file object is iterable, but the read_lines
Python function uses the yield
keyword. In Python the yield
keyword creates a generator. Generators can be iterated (because generators “implement the iterator protocol”). So read_lines
is a “generator function”; calling read_lines
immediately returns a generator-iterator. The code in the generator function “only runs when called by next(g)
or g.send(v)
, and execution is suspended when yield
is encountered”. When should you use a generator? Use a generator when “you don’t know if you are going to need all results, or where you don’t want to allocate the memory for all results at the same time.” So maybe we only want to translate the first 5 lines of dialog in our .srt files – no need to read the entire file for that!
Does Rust have generators? Using the yield
keyword technique in Rust requires experimental/unstable features #![feature(generators, generator_trait)]
. Instead of using the experimental Rust yield
keyword, you could explicitly implement the Iterator protocol (i.e. define the fn next($mut self)
function). Our Rust function already returns Lines<BufRead>
which implements the Iterator protocol, so I won’t implement the Iterator
in Rust (already implemented), and I won’t try to use in Rust with the yield
keyword (too advanced for me!)
Instead I’ll consider these Python alternatives to the existing read_lines
“generator function”
# Python generator function
def read_lines(sub_file):
with open(sub_file, 'rt') as f:
for line in f:
striped = line.strip()
if striped:
yield striped.decode('utf-8')
Generator expression
Here is a Python read_lines
function that returns a “generator expression”. This SO post discusses the difference between a generator expression and a generator function All the answers agree you should use whichever approach is clearer / more “readable”. What do you think?
# Python function that returns a generator expression
def read_lines(sub_file):
return (line.decode('utf-8') for line in open(sub_file, 'rt') if line)
Filter and map functions
Here is a Python read_lines
function that uses filter(...)
and map(...)
functions. In Javascript I always go to .map()
and .filter()
methods when I need to process some data. For me, seeing the words “filter” and “map” send a clear message about the purpose of the code. I especially like how Javascript syntax lets us write .filter()
first and then .map()
so you read the methods in the order they are executed (a technique I know as “chaining”, related to the idea of “piping”). Also in Javascript each method can be put on a newline, which enhances readability. Python doesn’t let us do that; the filter appears inside the map and I could use newlines but it is awkward.
# Python function that uses map and filter
def read_lines(sub_file):
return map(lambda l:l.decode('utf-8'), filter(lambda l:l, open(sub_file, 'rt')))
Compare to a Javascript-like function below. What I love about it:
- the code is executed in the order it’s written (i.e. file is opened, then filtered, then mapped)
- each new step in processing is on a new line, (and the processing step
.filter
,.map
is the first word on the line) - and indentation is not required (so the code doesn’t get very “wide”, I can read it top-to-bottom instead of left-to-right)
// Javascript-like function that uses map and filter
def read_lines(sub_file):
return file
.open(sub_file)
.filter(_=>_)
.map(_=>_.decode('utf-8'));
So the Python tendency for code to collapse onto a single line makes our generator expression and filter/map functions less readable – I think the original generator function (using the yield
keyword) is best. These considerations should influence our Rust translation. The Rust book talks about the concept of readability:
However, one long line is difficult to read, so it’s best to divide it. It’s often wise to introduce a newline and other whitespace to help break up long lines when you call a method with the .method_name() syntax.
So let’s revisit our function. With the helps of questions like [“most efficient way to filter Lines<BufReader
Translating read_files
…TODO…
Translating combine
…TODO…
Zipping it up
Translating write_combined_file
…TODO…
Important Differences
String
vsstr
-
String
is the dynamic heap string type, likeVec
: use it when you need to own or modify your string data.str
is an immutable sequence of UTF-8 bytes of dynamic length somewhere in memory. Since the size is unknown, one can only handle it behind a pointer. This means thatstr
most commonly2 appears as&str
-
Reader
vsBufReader
-
A
BufReader<R>
performs large, infrequent reads on the underlyingRead
and maintains an in-memory buffer of the results…BufReader<R>
can improve the speed of programs that make small and repeated read calls to the same file or network socket. It does not help when reading very large amounts at once, or reading just one or a few times
-
BufReader
vsBufRead
BufRead
is a trait, whereasBufReader
is a struct. Traits are similar to interfaces in other languages, Rust traits define shared behavior.. TheBufReader
strict implements theBufRead
trait
struct
vstype
- > A struct is a type– “type” is the more general category; struct is one kind of type..unwrap()
vs?
Rust deals with theResult
object- a “lambda” vs a “closure” generally
- Do the general ideas apply to a Python
lambda
vs Rustclosure
- Do the general ideas apply to a Python
- Python
__exit__()
vs Rustdrop()
- “chaining” methods vs “piping” outputs as inputs