Refactoring and Finding the Right Data Structure

Recently, before implementing a new feature in pandoc-include-code, a Pandoc filter I have written, I needed to make some changes to the code. The program allowed what should’ve been considered conflicting modes of operation. In this episode I’ll perform the same modification, to show you how the right data structure can guide and improve your implementation.

Show Notes

The purpose of the filter is to allow code snippets to be included from external files, instead of copy-pasting snippets into the Markdown document. By specifying the source code language, and with the include attribute specifying a file path, the filter will populate the code block with the contents of that file. By using startLine and endLine, you can control the range of lines included from the file.

# My Project

```{.haskell include=Example.hs startLine=1 endLine=6}
```

Another way of including a smaller part of a file is to specify a named snippet. This expects you to have two lines in the included file, usually source code comments, specifying the start and end of the snippet.

## Details About Stuff

```{.haskell include=Example.hs snippet=main}
```

Let’s have a look at the Example.hs file. Lines 1 to 6 are included by the range code block, and 9 to 11 by the snippet code block. These two comments mark the start and end of the “main” snippet.

module Example where

myExample = do
  putStrLn "Hello"
  putStrLn "..."
  putStrLn "world."

-- start snippet main
main = do
  myExample
  putStrLn "Bye!"
-- end snippet main

Rendering this Markdown document to HTML, you can get something like the following in a web browser.

In the feature request I got, a user asked for the ability to combine the .numberLines class in Pandoc with pandoc-include-code and its named snippets, to have included snippets show line numbers starting from their original line number in the source file.

The following HTML shows the expected output. Notice how it starts at 9, not 1.

When I started on adding support for .numberLines, I quickly realized a flaw in the implementation of pandoc-include-code. It had three modes of operation:

including an entire file,
including a line range from a file, and
including a named snippet from a file.

The first case was all right, but the second two had some overlap; there was nothing stopping a user from specifying both snippet and startLine/endLine.

```{.haskell include=Example.hs startLine=1 endLine=6 snippet=main}
```

Using both in combination did not make any sense to me, and it made the feature request much harder to implement, so I decided to first do some refactoring, making the implementation explicitly state the modes of operation, and making them mutually exclusive.

Let’s head back to the revision before my refactoring, and see where we end up.

Introducing the InclusionMode Data Type

The InclusionSpec data type represents the work to be done by the filter; it carries the file to include, the optional snippet name, the optional range of lines, and also a number of columns to dedent. We won’t touch dedent at all, so you can ignore it.

data InclusionSpec = InclusionSpec
  { include :: FilePath
  , snippet :: Maybe Text
  , range   :: Maybe Range
  , dedent  :: Maybe Int
  }

Remember how I said the modes of operation should be mutually exclusive? If you look at this data type, and specifically at the snippet and range fields, we are not representating that design decision in a clear way. By using two optional modes, we do allow for them both to be present. Let’s cut them out, and make the mode explicit using a data type.

data InclusionSpec = InclusionSpec
  { include :: FilePath
  , mode    :: InclusionMode
  , dedent  :: Maybe Int
  }

The InclusionMode data type will have three constructors – one for each valid mode of operation – snippet mode, range mode, and entire file mode.

data InclusionMode
  = SnippetMode Text
  | RangeMode Range
  | EntireFileMode
  deriving (Show, Eq)

We want these to be comparable and printable using show, and we see from the type error that it requires the Range type to provide the corresponding instances.

data Range = Range { rangeStart :: Int
                   , rangeEnd   :: Int
                   }
                   deriving (Show, Eq)

Following the Type Errors

Jumping to the next error, we see that there is no longer a range in scope. This is one of the fields we removed from the InclusionSpec. Let’s see how the filterLineRange function is actually used.

allSteps :: Inclusion Text
allSteps =
  readIncluded
    >>= splitLines
    >>= filterLineRange
    >>= onlySnippet
    >>= dedentLines
    >>= joinLines

Here we see that the definition for allSteps reads the included file, splits the lines, filters by the line range, extracts the named snippet, dedents the lines, and finally joins the lines back together. Now that we’re making the modes mutually exclusive, we can combine filterLineRange and onlySnippet into a single step. We’ll rename it includeByMode.

allSteps :: Inclusion Text
allSteps =
  readIncluded
    >>= splitLines
    >>= includeByMode
    >>= dedentLines
    >>= joinLines

Instead of asking for an optional range, we’ll ask for and pattern match on the mode. In case we’re in RangeMode, we can do the same thing as before. If we’re in EntireFileMode, we’ll just return the lines unchanged. The pattern match for Just in onlySnippet needs to be merged in.

includeByMode :: Lines -> Inclusion Lines
includeByMode ls = asks mode >>= \case
  RangeMode range -> return
    (take (rangeEnd range - startIndex) (drop startIndex ls))
    where startIndex = pred (rangeStart range)
  EntireFileMode -> return ls
  SnippetMode name ->
    return $ drop 1 $ takeWhile (not . isSnippetEnd name) $ dropWhile
      (not . isSnippetStart name)
      ls

Nice, we now have all three modes covered. Jumping to the next error, we see that the parsing that constructs our InclusionSpec needs some changes as well. Instead of the optional range and snippet, we now want a single mode. We rename getRange to parseRangeMode, and wrap the parsed range with the RangeMode constructor. In the same way, snippet becomes snippetMode, wrapping the parsed snippet name with the SnippetMode constructor.

...
 where
  lookupInt name = HM.lookup name attrs >>= readMaybe
  snippetMode  = SnippetMode . Text.pack <$> HM.lookup "snippet" attrs
  dedent   = lookupInt "dedent"
  parseRangeMode = case (lookupInt "startLine", lookupInt "endLine") of
    (Just start, Just end) -> maybe (throwError (InvalidRange start end))
                                    (return . Just . RangeMode)
                                    (mkRange start end)
    (Nothing, Just _ ) -> throwError (IncompleteRange Start)
    (Just _ , Nothing) -> throwError (IncompleteRange End)
    (Nothing, Nothing) -> return Nothing

Selecting a Single Inclusion Mode

We select a single mode by transforming a list of Maybe InclusionMode values to a list of modes. The catMaybe function filters out the Nothing values, and extracts the Just values. Pattern matching on that resulting list, if we have no modes specified, we default to EntireFileMode. In case we have a single mode, we use that. In case we have more than one mode specified, we throw an error, describing the conflicting modes.

...
  Just include -> do
    rangeMode <- parseRangeMode
    mode <-
      case catMaybes [rangeMode, snippetMode] of
        []  -> return EntireFileMode
        [m] -> return m
        ms  -> throwError (ConflictingModes ms)
    return (Just InclusionSpec {..})
...

We need to import catMaybes from the Data.Maybe module.

import Data.Maybe (catMaybes)

Jumping to the next error, we are reminded that we have yet to define the ConflictingModes error constructor. It will hold the list of conflicting modes.

data InclusionError
  = InvalidRange Int
                 Int
  | IncompleteRange MissingRangePart
  | ConflictingModes [InclusionMode]
  deriving (Show, Eq)

And in the last error in this exercise, the compiler kindly informs us that the formatError function is partial. We add the missing case.

...
 where
  formatError = \case
    InvalidRange start end ->
      "Invalid range: " ++ show start ++ " to " ++ show end
    IncompleteRange Start -> "Incomplete range: \"startLine\" is missing"
    IncompleteRange End   -> "Incomplete range: \"endLine\" is missing"
    ConflictingModes ms -> "Conflicting inclusion modes: " ++ show ms

Summary

Next error? Nope, we’re done. I hope this highlights how a suitable data structure can guide you towards a much more solid program, using very simple Haskell code.

There are some tests that need changes, that I haven’t shown you, and of course the feature request itself that led to this modification, which I’ve left out. If you’re interested, drop a comment on YouTube or on Patreon, and I might do a follow-up video in the same style.

Thanks for watching!