Pure Standard ML git plumbing — the byte-level
object, packfile, index and ref formats that every git implementation
shares, with no networking. Loose objects (blob/tree/commit/tag),
SHA-1 object ids, zlib (de)compression, packfile decoding with full
ofs-delta and ref-delta reconstruction, the dircache (.git/index), and
ref parsing (loose refs + packed-refs) — all as pure, deterministic codecs
over byte strings.
zlib and SHA-1 come from the vendored
sml-deflate (which bundles
sml-codec and a string-based
Zlib facade). No FFI, no threads, no clock, no network — and deterministic,
byte-identical under both MLton and
Poly/ML.
A network fetch/push transport (the git smart-HTTP / ssh protocol) is deliberately out of scope: that would be a separate, quarantined IO tool on top of this pure format layer. Everything here works against on-disk git objects with zero IO of its own.
- 114 assertions, green on MLton and Poly/ML, with byte-identical output.
- Validated against real
gitfixtures committed undertest/fixtures/, generated by the systemgitCLI (generate.sh) with pinned identities/dates so every object id is reproducible:hashObjectreproduces git's exact 40-hex oid for every object git wrote — blob, tree, commit, annotated tag (cross-checked withgit hash-object/git rev-parse).- loose objects inflate → parse → re-hash back to the same oid;
encodeLooseround-trips. - two real packfiles of the same repo — one with OBJ_OFS_DELTA (from
git repack) and one with OBJ_REF_DELTA (fromgit pack-objects --no-delta-base-offset) — decode to all 10 objects, and every reconstructed object (including a delta-encoded blob and a delta-encoded commit) re-hashes to git's oid. - the real
.git/index(v2) parses to the same 3 entriesgit ls-filesreports;packed-refsand loose refs parse to the expected values.
- Basis-library only; vendors
sml-deflate(Layout B), so the repo builds standalone.
With smlpkg:
smlpkg add github.com/sjqtentacles/sml-git
smlpkg sync
The library's MLB pulls in the vendored sml-deflate (zlib) + sml-codec
(SHA-1, base16), so structure Git and those structures all come into scope.
(* bytes are raw strings: one char per byte, 0-255 *)
(* hash an object exactly as git does: SHA-1 over "<type> <len>\0<payload>" *)
val oid = Git.hashObject (Git.Blob "hello\n")
(* "ce013625030ba8dba906f756967f9e9ca394464a" — same as `git hash-object` *)
(* decode a loose object straight off disk (zlib + parse) *)
val obj = Git.decodeLoose loosebytes
val Git.Commit {tree, parents, author, message, ...} = obj
(* parse a tree payload into (mode, name, oid) entries *)
val entries = Git.parseTree treepayload
(* decode a packfile (.pack + .idx), applying ofs/ref deltas *)
val pack = Git.Pack.parse {pack = packbytes, idx = idxbytes}
val all = Git.Pack.objects pack (* (oid, obj) list, deltas applied *)
val one = Git.Pack.lookup pack oid (* obj option *)
(* parse the dircache and refs *)
val staged = Git.Index.parse indexbytes (* {path, id, mode, size} list *)
val target = Git.Ref.parse "ref: refs/heads/main\n" (* Symbolic "refs/heads/main" *)
val refs = Git.Ref.parsePacked packedrefsbytes (* (name, oid) list *)make example runs examples/demo.sml over the real
fixture objects — decoding a loose commit, walking its tree, and reconstructing
a delta-encoded blob out of a packfile, verifying every oid against git's:
sml-git demo
============
loose object : 94c43da0c98a3a96c58f00d6e5a06aa70c0dd410
type : commit
tree : 86fa20b61dbf28683a0ae91e87fcdab4e0854186
parent : 0f0b31f0d114015c2b3de56e88c2781c8564b2de
author : Fixture Author <author@example.com> 1700000000 +0000
message : second commit: edit big.txt and add docs/note.txt
hashObject : 94c43da0c98a3a96c58f00d6e5a06aa70c0dd410
oid matches : true
tree 86fa20b61dbf28683a0ae91e87fcdab4e0854186 :
100644 ee16ef7b005794b50717c165021d16044454c00b big.txt
40000 ebe2cbff43af78278245c9487c67456f2f3eb4e0 docs
100644 ce013625030ba8dba906f756967f9e9ca394464a hello.txt
packfile : 10 objects (version 2)
delta object : f0d872f6993b3b84102595978057dc29962f45a6
reconstructed : 32499 bytes
oid matches : true
type treeEntry = { mode : string, name : string, id : string } (* id = 40-hex *)
type commit = { tree : string, parents : string list, author : string
, committer : string, message : string }
type tag = { object : string, typ : string, tag : string
, tagger : string, message : string }
datatype obj = Blob of string
| Tree of treeEntry list
| Commit of commit
| Tag of tag
val objectType : obj -> string (* "blob"|"tree"|"commit"|"tag" *)
val serialize : obj -> string (* "<type> <len>\0<payload>" bytes *)
val payload : obj -> string
val hashObject : obj -> string (* 40-hex SHA-1 oid, = git's *)
val encodeLoose : obj -> string (* zlib-compressed loose object *)
val decodeLoose : string -> obj (* inflate + parse *)
val parseObject : { typ : string, payload : string } -> obj
val parseTree : string -> treeEntry list
val parseCommit : string -> commit
val parseTag : string -> tag
structure Pack : sig
type pack
val parse : { pack : string, idx : string } -> pack
val objects : pack -> (string * obj) list (* deltas applied *)
val lookup : pack -> string -> obj option
val count : pack -> int
val version : pack -> int
end
structure Index : sig
type entry = { path : string, id : string, mode : int, size : int }
val parse : string -> entry list (* dircache / .git/index v2 *)
val version : string -> int
end
structure Ref : sig
datatype ref = Direct of string | Symbolic of string
val parse : string -> ref (* a loose ref file's contents *)
val parsePacked : string -> (string * string) list (* packed-refs *)
end- Bytes as
string. Object payloads, blob contents and SHA-1 outputs are raw byte strings (one char per byte, 0–255), matching the rest of the sjqtentacles crypto/codec family. - Oids are 40-hex. Everything you see is a lowercase 40-char hex oid, exactly
as
git rev-parseprints it; the 20 raw bytes git stores inside tree entries and ref-deltas are converted to/from hex at the library boundary. Pack.parsetakes both.packand.idx. The idx supplies each object's byte offset, which bounds the per-object zlib stream exactly (the streams are stored back-to-back with no length markers, and the bundled inflate verifies a trailing Adler-32). This is also how real git pairs the two files.- Deltas. Both ofs-delta (relative base offset) and ref-delta (base oid) are reconstructed by applying the copy/insert instruction stream against the fully-resolved base object.
- 32-bit-safe. All binary reads stay below 2³¹ so MLton's default 32-bit
Intnever overflows; packs larger than 2 GiB (64-bit offsets) are rejected rather than silently mis-decoded. - Malformed input raises
Git.
make test # MLton
make test-poly # Poly/ML (via tools/polybuild)
make all-tests # both
make example # build + run examples/demo.sml over the fixtures
make clean
Regenerate the fixtures (requires a system git):
sh test/fixtures/generate.sh
MIT — see LICENSE.