All posts
Mar 202611 min read

Rewriting a Python web log parser in Go

I rewrote a Python web log parser as a Go HTTP service using Factory mission mode. On 1.89 million lines of real NASA access logs, the Go version parses 3.3x faster than the Python baseline. The whole thing -- service, tests, benchmark harness -- was built in one orchestrated mission.

benchmarksgorewrite studysystemsfactory

Rewrite studyPublished

Project

parser-go

Baseline

Python parser @ 904f838

Rewrite

parsergo (Go) @ 8e2ef20

Language

Go

Head-to-head on 1.89M lines of real NASA Kennedy Space Center access logs (July 1995). Go parses 3.3x faster than Python.

Speedup

3.3x faster

Go throughput

~485K lines/sec

Memory (RSS)

~723 MiB (Go) vs ~718 MiB (Py)

Dataset

1.89M lines (NASA KSC)

Lines of Go

~12,000

  • Go parses 1.89M real NASA access logs in 3.9s vs 13.0s for Python -- a 3.3x speedup.
  • Both implementations use the same regex and produce identical workload counts (1,887,880 matched, 3,834 malformed).
  • Peak RSS is comparable (~723 MiB Go vs ~718 MiB Python). The speedup comes from CPU, not memory.
  • The entire project was built using Factory mission mode in one orchestrated mission spanning multiple worker sessions.

Go to: The problem | Architecture | The parser | The service | The benchmark harness | Results | How mission mode built this | Limitations

I had a Python script that parsed Apache/Nginx combined log format access logs. It worked, but it was slow, it was a script, and I wanted an HTTP service I could throw log files at and get back structured summaries. So I rewrote the whole thing in Go.

The interesting part is not just the rewrite itself. It is how I built it: using Factory mission mode, which planned the entire project, wrote the code across multiple worker sessions, ran its own scrutiny reviews, and validated every milestone before moving on. On 1.89 million lines of real NASA access logs, the Go version is 3.3x faster than the Python baseline. The final repo has ~12,000 lines of Go across 33 files, a full HTTP service, a benchmark harness that enforces correctness before allowing performance claims, and 8 passing test suites. That all came out of one mission.

The problem

Combined log format looks like this:

198.51.100.24 - - [27/Mar/2026:22:35:03 -0700] "GET /api/users HTTP/1.1" 200 1234
198.51.100.55 - - [27/Mar/2026:22:35:04 -0700] "POST /api/orders HTTP/1.1" 201 89
198.51.100.24 - - [27/Mar/2026:22:35:05 -0700] "GET /healthz HTTP/1.1" 200 2

The job is: parse every line, count requests by method+path, filter out noise like health checks, and produce a ranked summary. The Python version did this as a CLI tool. I wanted something I could curl a log file at and get JSON back, with an HTML report I could pull up in a browser.

Architecture

The Go version ended up with a pretty clean split:

cmd/parsergo/          CLI entrypoint, server wiring, readiness probing
internal/analysis/     The parser itself (regex, line-by-line streaming)
internal/summary/      Canonical summary computation (deterministic ranking)
internal/api/          HTTP handlers (analysis submission, job polling, reports)
internal/job/          In-memory job store with state machine
internal/server/       Server type (minimal, used by entrypoint)
internal/bench/        Benchmark harness (parity checks, fairness controls)

There is no database. Jobs live in memory with a configurable retention period (PARSERGO_RETENTION, defaults to 24h). The whole service is one binary with zero dependencies outside the Go standard library.

The design rule I cared about most: the canonical summary is the single source of truth. The same summary struct flows into API responses, HTML reports, and benchmark parity checks. There is no place where the report can quietly disagree with the API.

The parser

The parser is straightforward. It uses a compiled regex to match combined log format:

var combinedLogRegex = regexp.MustCompile(
    `^(?P<remote>\S+)\s+` +
        `(?P<ident>\S+)\s+` +
        `(?P<auth>\S+)\s+` +
        `\[(?P<timestamp>[^\]]+)\]\s+` +
        `"(?P<method>\S+)\s+(?P<path>\S+)\s+(?P<protocol>\S+)"\s+` +
        `(?P<status>\d+)\s+` +
        `(?P<size>\d+|-)`,
)

Each line goes through parseCombinedLog, which extracts method, path, status, size, and timestamp. Lines that don't match the regex are counted as malformed. Health check paths (/healthz, /readyz, /ping, etc.) are filtered out and tracked separately.

The engine wraps a bufio.Scanner around a counting reader so it can report exact input bytes without the off-by-one errors you get from counting newlines:

counter := &countingReader{reader: r}
scanner := bufio.NewScanner(counter)

for scanner.Scan() {
    line := scanner.Text()
    result.TotalLines++

    rec, err := e.parseLine(line)
    if err != nil {
        result.Malformed++
        continue
    }
    if rec == nil {
        result.Filtered++
        continue
    }
    result.Matched++
    result.Records = append(result.Records, *rec)
}
result.InputBytes = counter.count

The workload accounting -- TotalLines, Matched, Filtered, Malformed -- is tracked at the engine level so the benchmark harness can later verify that both the Python baseline and the Go rewrite did the same amount of work on the same input.

The summary

After parsing, the summary computation aggregates records by method+path and sorts them:

sort.SliceStable(sum.RankedRequests, func(i, j int) bool {
    if sum.RankedRequests[i].Count != sum.RankedRequests[j].Count {
        return sum.RankedRequests[i].Count > sum.RankedRequests[j].Count
    }
    if sum.RankedRequests[i].Path != sum.RankedRequests[j].Path {
        return sum.RankedRequests[i].Path < sum.RankedRequests[j].Path
    }
    return sum.RankedRequests[i].Method < sum.RankedRequests[j].Method
})

Primary sort by count descending, tie-break by path then method ascending. This is deterministic: identical input always produces identical output. That matters a lot for benchmarking, because if the ranking is nondeterministic, you can't diff the Go output against the Python output and call it a correctness check.

Requests-per-second is derived from the timestamp span in the data (first record to last record), not from wall-clock time. That was a bug in the first version -- the rate changed depending on how fast your machine was, which made benchmark parity impossible.

The service

The HTTP service is a single http.ServeMux with no framework:

mux := http.NewServeMux()
analysisHandler.RegisterRoutes(mux)
reportHandler.RegisterRoutes(mux)

POST /v1/analyses accepts a multipart upload with format, profile, and file fields. The handler validates the content type, checks the format is supported (only combined for now), rejects unsafe filenames, and enforces a size limit. If the queue is full, it returns 429 with a Retry-After header instead of silently dropping the request.

Jobs go through a state machine: queued -> running -> succeeded | failed. There is also expired for jobs past the retention window. The API supports idempotent submissions -- if you send the same file with the same Idempotency-Key header, you get back the original job instead of creating a duplicate.

The report surface at /reports/{id} renders self-contained HTML with inline SVG charts. No CDN, no external fonts, no JavaScript fetches. The page works offline if you save it.

One detail I'm happy with: the readiness probe. /readyz returns 503 during startup, and only flips to 200 after the server has proven it can actually serve traffic (by probing its own /healthz endpoint). Submissions during the startup window get 503, not a silent 202 that creates a job that will never run.

The benchmark harness

This is where it gets interesting. The benchmark harness in cmd/bench does not just time two programs. It enforces that they produce the same output before it lets you compare their speed.

The flow for one scenario:

  1. Load the scenario definition (corpus path, normalization rules, fairness controls)
  2. Resolve the Python baseline and Go rewrite binaries
  3. For each round, alternate execution order (baseline-first or rewrite-first, determined by hashing the scenario ID)
  4. Before each timed run, drop the file cache (sync + write to /proc/sys/vm/drop_caches)
  5. Pin both processes to CPU 0 via taskset -c 0
  6. Run warmup iterations (discarded), then measured iterations (timed with getrusage)
  7. Collect wall time, CPU time, and max RSS from each iteration
  8. After both sides finish: normalize their outputs and diff them field by field

The parity check compares:

If any field drifts between baseline and rewrite, the run fails parity and the harness sets performance_claims_allowed: false. You literally cannot get a benchmark number out of this harness without first proving correctness.

The fairness controls are also validated, not just recorded. The harness checks that warmup and measured iteration counts match what was declared, that cache drops succeeded, that taskset was actually applied. If any control cannot be proven, claimable goes to false.

report.Claimable = report.Symmetric
for _, evidence := range controlEvidence {
    if !evidence.Claimable {
        report.Claimable = false
        break
    }
}

Results

I benchmarked both the Go and Python parsers on the same real data: the NASA Kennedy Space Center HTTP access logs from July 1995. This is a publicly available dataset of 1,891,715 requests to a production web server, freely redistributable from the Internet Traffic Archive.

Methodology

Both implementations use the same named-group regex pattern, the same health-check filter list, and read the same file. Both extract all fields (timestamp, method, path, status, size) and build a record list. The Python benchmark uses datetime.strptime for timestamp parsing; Go uses time.Parse. Both were run 10 times on the same machine with no other significant load.

Both produce identical workload counts: 1,887,880 matched, 1 filtered, 3,834 malformed. The 3,834 malformed lines are requests from 1995 with unencoded spaces in the URL (e.g., GET /htbin/wais.pl?orbit sts71 HTTP/1.0), which both parsers reject since the regex requires \S+ for each request field.

Head-to-head

Go 1.26Python 3.11Ratio
Mean wall time3.91s13.01s3.3x
Std dev0.07s0.10s
Lines/sec~485,000~145,0003.3x
MB/sec52.515.03.5x
Peak RSS723 MiB718 MiB~1x

The Go version parses the full dataset in 3.9 seconds versus 13.0 seconds for Python -- a 3.3x speedup.

The gap is larger than the ~1.9x I saw in earlier tests on tiny synthetic inputs (5 and 18 lines), where process startup overhead dominated the measurement. At scale, the real bottleneck is timestamp parsing and per-line object allocation, where Go's time.Parse and value-type structs have a significant CPU advantage over Python's datetime.strptime and heap-allocated tuples. Peak RSS is roughly the same for both -- the win is wall time, not memory.

Reproducibility

The repo includes Go-native benchmarks (go test -bench) that anyone can run. A 10,000-line NASA sample is committed at benchmark/corpora/nasa/nasa_10k.log. For the full dataset:

curl -o /tmp/NASA_access_log_Jul95.gz ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
gunzip /tmp/NASA_access_log_Jul95.gz
mv /tmp/NASA_access_log_Jul95 /tmp/nasa_jul95
go test -bench=BenchmarkParse_NASAFull -benchmem ./internal/analysis/

Environment

All numbers from one machine: Intel i5-12500T (12 logical cores), 64 GB RAM, Debian 12, Linux 6.17, Go 1.26, Python 3.11.2.

How mission mode built this

This is the part I actually want to talk about.

I did not write this code by hand over a few weekends. I described what I wanted to Factory and ran it as a mission. The mission system broke the project into milestones:

  1. Foundation -- Go module, CLI entrypoint, project scaffolding
  2. Service slice -- Analysis engine, HTTP API, report surface
  3. Service hardening -- Idempotency, backpressure, retention, terminal failure handling
  4. Benchmark homelab -- Benchmark harness, parity gates, fairness controls, real traffic scenarios
  5. Publication -- Release candidate generation (internal staging), documentation

Each milestone went through the same cycle: worker sessions wrote the code, a scrutiny validator reviewed every completed feature for bugs, and a user-testing validator ran the actual service and verified behavior against a contract. If scrutiny found issues, the orchestrator created fix features and re-ran validation.

Some examples of things scrutiny caught and the system fixed on its own:

These are not exotic bugs. They are exactly the kind of thing that survives code review because each one looks fine in isolation. The mission system caught them because it had specific behavioral contracts to test against and it ran the checks automatically after every implementation feature.

The total output: 20 Go source files (plus 13 test files), 8 test suites that all pass, a benchmark harness with fairness enforcement, and a working HTTP service. The mission ran 38 feature sessions across 5 milestones. Each milestone's scrutiny pass spawned review subagents for every completed feature, and the user-testing pass spawned flow validators that actually started the service, submitted log files, polled for results, and checked the response.

Testing

The test suite is not just happy-path unit tests. Some highlights:

The tests run in about 1 second total. Most API tests use httptest.NewRecorder so they stay in-process, but the entrypoint and evidence tests do bind real local listeners via net.Listen and httptest.NewServer.

Limitations

I do not want to oversell what is still a small-scope project.

  1. One host. All numbers come from one i5-12500T. Different CPUs, different operating systems, different filesystems would give different numbers.
  2. Combined log format only. The parser does not handle JSON logs, Caddy format, or anything else. Adding a new format means implementing a new parser function.
  3. No persistence. Jobs live in memory. Restart the service and everything is gone. This is fine for a single-user tool, not for production.
  4. The Python baseline is vanilla CPython. A Python rewrite using compiled parsing, re2 bindings, or a C extension would close the gap. The 3.3x number reflects CPython 3.11's regex and datetime.strptime performance, not an optimized Python implementation.

Source

The full source is at github.com/sagaragas/parser-go. It is Apache-2.0 licensed. Clone it, run go test ./..., run go test -bench=. ./internal/analysis/ to reproduce the benchmarks, or run go run ./cmd/parsergo serve and throw a log file at it.