Encode and Decode Strings

“Think about in-place manipulation — how do you rewrite an array without extra space?”

🌱 NoviceMedium

The Problem

Design an algorithm to encode a list of strings to a single string. The encoded string is then sent over the network and is decoded back to the original list of strings. Machine A has to send a message to machine B, so implement the two methods encode and decode. Machine A calls encode before sending and Machine B calls decode after receiving.

Example 1:

Input:   ["hello", "world"]
Encoded: "5#hello5#world"
Output:  ["hello", "world"]

Example 2:

Input:   ["a#b", "cd"]
Encoded: "3#a#b2#cd"
Output:  ["a#b", "cd"]

Example 3:

Input:   [""]
Encoded: "0#"
Output:  [""]

The Shipping Belt Analogy

Imagine a busy warehouse that ships packages across the country on a long conveyor belt. Before anything goes on the belt, a packer stamps a shipping label on each package. The label states exactly how many items are inside — say, 5 items — followed by a dividing bar |, followed by the actual contents of the package. When the packages arrive, a warehouse worker at the other end reads each label, counts out exactly that many items from the belt, sets them aside as one package, then moves on to the next label.

The key insight is how the worker reads the belt. They never look for a special marker character inside the package contents. They read the label's count, take exactly that many items, and advance. This means even if a package happens to contain something that looks like a label — say, a box labeled 3|abc sitting inside another package — the worker doesn't care. They already know how many items to take, and they stop counting the moment they hit the quota.

In this problem, we're the packer and the worker. Our strings are the packages. Encoding means stamping a length label on every string and loading everything onto one long belt (one big encoded string). Decoding means reading each label in turn and peeling off exactly that many characters to recover each original string. The separator between the label and the contents is # — our dividing bar — and what makes the scheme bulletproof is that we never scan for # inside the content. We always read by count.

Understanding the Analogy

The Setup

We have a pile of packages arriving at the loading dock. Each package can contain absolutely anything — any combination of symbols, numbers, or even characters that look like our label format. Our job is to load all of them onto a single conveyor belt in one continuous stream, then at the other end recover every original package without mixing up contents or accidentally splitting one package in two.

The constraint that makes this hard: we cannot agree on a special "end of package" character, because any character we pick might appear inside a package's contents. If we try to use | as a separator and a package contains |, the worker at the other end won't know whether that | is the separator or part of the contents.

The Length Label

Instead of separating packages with a special character, we stamp a length label on the front of each package before it goes on the belt. The label says: "the next N characters on the belt are mine." The label is formatted as N# — the count as a string of digits, then the dividing #, then the N characters of content.

Why # as the divider between label and content? Because our label is always just digits, we know to read digits first, then look for the # that terminates the label. Once we see #, we stop reading the label and start counting out N characters of content. Crucially, we never search for another # inside those N characters — we count by the label, period.

Edge cases fall out naturally. An empty string gets the label 0# — zero items, then the divider, then nothing. The worker reads 0, skips the #, takes zero characters, and correctly recovers an empty string. A completely empty input list produces no labels at all, and decoding an empty belt returns an empty list.

Why This Approach

Reading by count is robust because it makes the content opaque. The worker doesn't need to understand what's inside a package — they just count. If you tried to encode by joining with a delimiter like ,, the moment any string contained a comma, the decoder would miscount. You'd be forced to escape every occurrence of the delimiter — which creates a new problem, since the escape character might appear too. Length-prefix encoding sidesteps the entire escaping problem: the content is never parsed, only counted.

The trade-off is that the encoded form is slightly longer (each string grows by the number of digits in its length plus one #). For most practical purposes that's fine — and the correctness guarantee you get in return is absolute.

How I Think Through This

I think about this in two passes. When I encode, I do the packer's job: for each string, I stamp its length, add #, then place the string right after it. I am not trying to protect special characters inside the string. I am making that unnecessary by telling the decoder exactly how many characters belong to this package.

When I decode, I keep one cursor at the start of the next label. I read digits until #, turn that into a length, then take exactly that many characters as the payload. After that, I jump the cursor to the next label and repeat. The thing I keep reminding myself is: once I know the length, the contents are opaque. I do not inspect them for structure. I just count.

Take ["a#b", "cd"].

Llabel#boundaryVpayload

1 / 1

input["a#b", "cd"]encoded belt3#a#b2#cd

READ

One belt can hold multiple stamped packages back-to-back. Each package is just length label, separator, then payload.

Building the Algorithm

Each step introduces one concept from the Shipping Belt analogy, then code blocks to try it.

For both steps, use the same running example:

input strings: ["a#b", "cd"]
encoded belt: "3#a#b2#cd"

Step 1: Stamp Each Package with Its Length Label (Encode)

The packer's job is simple: for every package in the pile, read off how many items it contains, write {count}#, then write the contents. Stack those labels back-to-back into one long belt string.

The encode function visits every string once and appends one chunk. That's it.

Consider ["a#b", "cd"]. At this step, the only thing that matters is how each original string is stamped onto the belt. The trace shows each package being turned into one {length}#payload chunk, building the final encoded belt "3#a#b2#cd".

Llabel#boundaryVpayloadDdone

1 / 3

current stringa#b

The belt starts empty. The encoder is about to stamp the package 'a#b'.

Loading editor...

Step 2: Read Each Label and Count Out the Package (Decode)

The decoder keeps a cursor — call it pos — that always points to the first digit of the next label. From there it scans right until it finds #, reads the digits before it as the length len, then takes exactly len characters as the payload.

Once the length is known, those len characters are treated as cargo, not structure. That is why "a#b" does not cause confusion: after reading 3#, the decoder already knows the next three characters belong together, so the # inside is just another item on the belt.

After consuming the payload, pos advances to hashPos + 1 + len — the start of the next label. That single invariant keeps the entire parse aligned from beginning to end.

Now use the encoded belt from Step 1: "3#a#b2#cd". This trace shows how the decoder reads that same belt back into ["a#b", "cd"] by staying aligned from one label boundary to the next.

PpointerLlabel#boundaryVpayloadDdone

1 / 7

pos

pos0

SCAN

The decoder starts with pos at index 0, which must be the beginning of the next label.

Loading editor...

Tracing through an Example

Use a fresh example so you can watch the full encode/decode cycle without reusing the same belt from the build steps.

Input: ["", "code", "x#y"]

PpointerLlabel#boundaryVpayloadDdone

1 / 11

input["", "code", "x#y"]

Start with three packages to stamp onto the shipping belt

an empty string, 'code'

and 'x#y'.

Common Misconceptions

"I can just join with a comma and split on commas when decoding." — This works until any string contains a comma. The worker can't tell whether a comma is a package separator or part of the contents. Length-prefix encoding bypasses this entirely: the worker reads by count, so the contents are never parsed for structure.

"The # separator could still appear inside a string and confuse the decoder." — It can't, because the decoder never scans for # inside the content region. After reading a label and computing len, the decoder advances the cursor by exactly len characters. The cursor lands on the next label's first digit — and a label always starts with digits, never #. The only # the decoder ever searches for is the one that terminates the current label.

"The decoder should find # by searching the entire remaining belt." — This is correct in combination with reading only up to the first # from the current position. indexOf('#', pos) finds the # that terminates the current label because pos always points to the start of a numeric label — the first # to the right of pos is always the label terminator, not a # inside some package's contents.

"An empty string in the list is a problem because there's nothing to put after the #." — An empty string simply gets the label 0#. The encoder stamps 0# on the belt; the decoder reads count=0, skips #, takes zero characters, and records "" on the clipboard. Empty strings work exactly the same as any other string.

Complete Solution

Loading editor...

← 238. Product of Array Except Self