Today I’ll be continuing my series of posts on the rust implementation of the Mercurial version control system I’ve been working on. In this post I’ll be focusing on what I learned this week about the rust module system as well as a few helpful crates I discovered to aid in command-line argument parsing and error handling.
Since my last post I’ve landed on a name for my project that’s a bit nicer than
hg-rust. From now on this project will be known as rug. I’ve renamed the
repository on sr.ht and the code now lives at
https://hg.sr.ht/~ngoldbaum/rug. There should be redirects in place so the URLs
in my old posts will continue to work. I’d also like to come up with a
logo. Perhaps a rug with a crab on it that’s playing with a droplet of mercury?
Probably not healthy for poor Ferris…
As of my last post, all of the
code
lived in a single main.rs file that had grown to more than 200 lines of
code. Long modules like this can make it difficult to understand exactly how
everything interrelates. Following the rust
book
I decided to break out the code in my project into submodules organized
according to the logical structure of the existing code.
First, I moved the
code
that defines the various custom structs I wrote last week out of main.rs and
into a new revlogs module. At this point my main.rs file was much, much
simpler:
use std::env;
use std::fs::File;
mod revlogs;
fn main() -> std::io::Result<()> {
let args: Vec<String> = env::args().collect();
let fname = &args[1];
let mut f = File::open(fname)?;
let revlog = revlogs::Revlog::new(&mut f)?;
println!("{}", revlog);
Ok(())
}
Before this change all of the code that defined the Revlog struct lived above
the definition of the main function. Now that code has been replaced with a
single line: mod revlogs. This line tells the rust compiler that there is
either a file named revlogs.rs or a file named revlogs/mod.rs. The latter
allows splitting out a module even further into submodules. The other
modification to the main function is the way I’m creating the Revlog
instance. Rather than being able to use the Revlog name directly, I need to
refer to it as revlogs::Revlog. I could have also said use revlogs::Revlog
above main to bring the Revlog struct into scope, but I prefer to avoid doing
that too much to make it clear where things are defined as I’m glancing at the
code.
I also needed to make the Revlog struct public, along with the new method I
implemented on it to create new Revlog instances from a file stream, so the
struct definition now looks like:
#[derive(Debug)]
pub struct Revlog {
inline: bool,
generaldelta: bool,
entries: Vec<RevlogEntry>,
}
And the signature for the new method now begins with pub fn new instead of just fn new. I haven’t thought in detail about what should be public versus what the
compiler insists has to be public due to how I’m using these modules. I think
for a command-line application it doesn’t matter so much what my public API is
because no one will be consuming it, but for a library it’s probably
important. I will come back to these considerations later and see if I can
understand how to manage separation of concerns in rust in more detail.
Next I further split
out
out the code for the Revlog struct into submodules for the Entry, Content,
and Header structs and then
moved
the content and entry modules to be submodules of the entry submodule. Now
everything is nice and modular, each module is relatively short, and the code is
structured according to the logical structure of the data structure the code
describes. Nice!
Error handling in rust is still something that confuses me. It’s very different
from how error handling works in Python with exceptions. In rust functions that
might raise errors return an enum called Result that wraps either a valid
return value or an error. One problem I have with this is that the errors in the
rust standard library do not contain context (e.g. a backtrace) unless you
explicitly add a context to the error. Any context associated with the error
needs to be present at the location the error gets created, calling sites higher
up the call stack that might have more information that would be usable to
create a more helpful error message must consume the error and transform it into
a new error with the appropriate context, all completely manually. Finally,
rust’s static type system means that errors of one type are not necessarily
convertible to errors of another type, so one must either explicitly convert
errors from one type to another or manually define the conversion methods to and
from a custom error type to other error types. This leads to a proliferation of
boilerplate code for each error type.
The rust error handling story is still somewhat in flux. For example, RFC
2504
describes an ongoing effort to reword the Error type in the standard
library. In online discussions people might suggest using the
error-chain crate, the
failure crate, or suggest just
using the standard library Error type and having lots of boilerplate in code
to handle conversions. As of early Summer 2019, the consensus seems to have
moved to the snafu crate. From my
perspective, one of the main advantages of snafu over failure is that
snafu has much better documentation that contains clear usage examples. That’s
the main reason I chose to use it. A recent reddit
thread
summarizes the state of things in 2019. I’m hoping that in the next year or two
this situation will grow more clear.
The philosophy behind the Snafu crate is to transform instances of errors
generated by standard library code or code outside of a developers control into
application-specific errors that are variants of a generic enum that represents
generic errors an application can produce. One defines an enum, in my case I
called it RugError, with variants that correspond to various kinds of errors:
use snafu::{Backtrace, Snafu}
#[derive(Debug, Snafu)]
enum RugError {
#[snafu(display("rug must be run from inside a valid directory"))]
NotAValidDirectory {
backtrace: Backtrace,
source: std::io::Error,
},
#[snafu(display("rug must be run from inside a repository"))]
NotARepository { backtrace: Backtrace },
#[snafu(display("The changelog file is not present in repository {}: {}",
path.display(), source))]
NoChangelog {
path: PathBuf,
source: std::io::Error,
backtrace: Backtrace,
},
#[snafu(display("The revlog file {} cannot be parsed: {}", path.display(), source))]
CannotParseRevlog {
path: PathBuf,
source: std::io::Error,
backtrace: Backtrace,
},
}
I’ve told the compiler that my RugError enum derives from the Snafu
attribute. Each variant in the RugError enum is given a snafu attribute, which allows
me to customize the error message based on context-specific data. Together these
attributes generate all of the error-conversion boilerplate that I would
otherwise need to write myself to allow instances of my error type to be created
from standard library errors.
Each error type can optionally define a source and backtrace field. If
source is defined, it maps to an error type. That means that the corresponding
variant must be created only from errors of the corresponding type. If one tries
to create an error from an incompatible error type that will lead to a type
mismatch and failed compilation. If source is not provided, that means one is
creating an error from the None variant of some Option.
If the backtrace field is defined, the error type generated by snafu will
contain a backtrace and when the error is printed out in a Debug or
Display representation, the backtrace will also be printed. This is extremely
helpful if it isn’t clear where exactly an error of some type might be generated
in the code or if it isn’t clear how a piece of code is ultimately getting
called by the application. Finally there can also be optional fields that
contain metadata one can use to construct a nice error message. For example the
CannotParseRevlog variant in my RugError enum contains a path field that
represents the path to the changelog file that cannot be parsed. The error
message generated by CannotParseRevlog uses both the path and the source
field to generate the error message.
To make use of these errors, the snafu crate provides the ResultExt and
OptionExt trait to extend the standard library Result and Option enums
with new methods that can transform errors at call sites. I made use of the
context method in a few places. For example, here is the function that
determines whether the current working directory is a mercurial repository:
use snafu::{OptionExt, ResultExt}
fn hg_dir(current_dir: PathBuf) -> Result<PathBuf, RugError> {
loop {
let p = match anc.next() {
Some(d) => d,
None => break None,
};
let possible_hg_dir = p.join(".hg");
if possible_hg_dir.is_dir() {
break Some(possible_hg_dir);
}
}
.context(NotARepository)
}
This function takes no arguments and returns a Result that can represent
either one of the custom errors I defined - a variant of the RugError enum, or
the path of the .hg directory in the root of the repository, represented by a
rust PathBuf object. The loop block returns an anonymous Option (e.g. it’s
not bound to a variable name), that I call context on. I pass context the
NotARepository variant. The context function converts the None variant of
the Option into the NotARepository error. If the error ever bubbled back to
main it would get printed along with a backtrace because NotARepository has
a backtrace field. All of this happens automatically - this is the magic of
the snafu crate!
Side note - this uses a newish feature of rust - the break statement can
return values from inside a loop block. This feature was very handy here.
Without it I would have needed to create a function that did the loop and
explicitly returned an Option.
I can also call context on a Result. For example, here’s the line where I
try to open the changelog file. If it isn’t present, I create a custom error
that includes the path to the file that is supposed to exist:
let mut f = File::open(&fname).context(NoChangelog { path: &fname })?;
One downside of the snafu approach to error handling is that I need to be
careful to ensure standard library errors get converted into RugError
variants. In practice this means replacing usages of ? with
context(SomeError)?, This can definitely be more verbose, however it also
forces me to think about the meaning of my code and what exactly each error
state really means. I’m hopeful that this will make debugging easier and lead to
fewer cases where I’m looking at opaque, poorly-described errors.
clap and structoptOf course it’s possible to parse command line arguments fully manually by
consuming the iterator over arguments returned by the std::env::args function,
as described in the
book. This
works but requires a lot of wheel-reinventing to get common behaviors like
subcommands, positional arguments, optional arguments, and help output to work
properly. It makes sense to delegate that work to an external library.
My first attempt at this used the clap library. In my usage of clap I
generated the command line arguments for the rug log subcommand like this:
use clap::{App, AppSettings, SubCommand};
fn main() -> Result<(), RugError> {
let matches = App::new("rug")
.version("0.1")
.author("Nathan Goldbaum")
.about("A rust implementation of some hg functionality")
.setting(AppSettings::ArgRequiredElseHelp)
.subcommand(SubCommand::with_name("log"))
.get_matches();
match matches.subcommand_name() {
Some("log") => {
hg_log()
}
_ => panic!("should be unreachable!"),
}
Ok(())
}
The name of the App corresponds to the name of the CLI binary. The version,
author, and about fields populate information in the help text for the
binary reported by rug --help. The setting usage tells clap to print the
help text if someone calls rug with no arguments. Finally the subcommand
creates a log subcommand that for now takes no arguments.
Finally to initiate the control flow for the program, I match over the name of
the subcommand that a user supplied and then do the work of running rug log if
someone passes in log. Note that the default branch is marked as unreachable,
that’s because any other subcommand name will be caught and result in an error
message reported to the user at the command line. Here’s a small command-line
session to see all of that in action:
$ rug
rug 0.1
Nathan Goldbaum
A rust implementation of some hg functionality
USAGE:
rug <SUBCOMMAND>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
SUBCOMMANDS:
help Prints this message or the help of the given subcommand(s)
log
$ rug notacommand
error: Found argument 'notacommand' which wasn't expected, or isn't valid in this context
USAGE:
rug <SUBCOMMAND>
For more information try --help
I also get colored output in the error case to visually highlight the important
parts of the error message to the user - the colored output doesn’t show up in
this post so don’t worry that you can’t see it here. I get all of this fancy
functionality more or less “for free” just by setting up clap. I like it!
One thing I don’t like is that I’m matching over strings. In general clap will
return strings to me that represent the values of command line
options. That will work but will be brittle. I also won’t be able to use the
ability of rust to check that I’m using all of the variants of an enum in a
match statement at compile time - so I might forget to implement a feature and
the compiler won’t alert me about it.
This problem is solved by structopt, another crate that wraps clap and
allows one to define the command-line arguments and subcommands in terms of and
enums or structs. Here is the equivalent structopt code to my usage of clap
above:
#[derive(StructOpt)]
#[structopt(
name = "rug",
about = "A rust implementation of some hg functionality",
author = "Nathan Goldbaum",
version = "0.1",
raw(setting = "structopt::clap::AppSettings::ArgRequiredElseHelp")
)]
enum Rug {
#[structopt(name = "log")]
Log {},
}
fn main() {
match Rug::from_args() {
Rug::Log {} => match hg_log() {
Ok(_) => {}
Err(e) => println!("{}", e),
},
}
}
We define an enum whose variants represent all of the different
subcommands. Each subcommand can then in turn define arguments that it
accepts. In main I instantiate an instance of the enum from the command-line
arguments and match over the result. Since the result will be variants of the
enum, I know that I’ve handled all possible subcommands, otherwise I would
generate a compiler error.
At this point I’m pretty happy with the state of things. The only thing that bothers me about structopt (and generically with code that uses rust’s attribute system) is that I’m programming inside of the attribute block, which feels a bit like writing code inside of a string: outside of normal control flow. My editor doesn’t highlight this code like normal code. The whole thing feels very magical. That said, I’m OK with the magic if it allows me to avoid a ton of boilerplate and make my code more maintainable.