The InterPlanetary File System (IPFS)

Aug 4, 2023 23 min read algorithms, p2p

I never had the chance to understand how IPFS – the magical public network where a bunch of Web3 stuff is stored – worked until now. Now that I do, I will try to answer some of the burning questions I had about it, namely:

who pays for storage in IPFS? Is it free for users or what?
How does the process of storing and retrieving a file actually work?
Does it perform competitively (in terms of price and QoS) with respect to centralized approaches?
Is it truly decentralized?

I reckon this is probably too much to address at once, so I will focus on (1) and (2), referring the reader to [1] for (3) and (4) (spoiler: it performs relatively well and is more decentralized than I would have guessed. The price part is harder to understand).

Who is Paying for This?

IPFS is a network of (computer) nodes, so it follows that the nodes in the network are storing and serving all the data. But that does not mean you can simply push a file and people will store or serve it for you.

Small Content. IPFS indexes content by means of a DHT¹. As is usual with DHTs, all files are assigned to a key, named a CID in IPFS, which is computed by applying a hash function (e.g. SHA-256) to the file’s content. The whitepaper [2] states that small files (< 1KB) get stored directly in the DHT, which would imply that DHT nodes pay for the storage of small content. It would also imply that small content would be guaranteed to have a very high default replication factor (on the order of \(20\) peers), which is what IPFS adopts in its Kademlia-flavoured [3] DHT. Unfortunately, this appears not to be true ², which means that no data ever gets stored in the DHT, only pointers to nodes holding that data. Where does it go then?

Content Providers. The key to understanding where data goes in IPFS is understanding that when you participate in the IPFS network, you get to decide which files you want to store (and serve), and this is true for everyone else. This means you must either store and serve your files yourself (which is probably not what most people want), or you need to get someone else to do it for you. Getting someone else to do it for you usually involves compensating that someone else (i.e. paying for it).

Pinning. The process of getting a node to store your content in a permanent basis and serving it to the network is called pinning. IPFS defines a REST API spec for pinning which storage providers can implement. This means that if you want to pin something to IPFS using a pinning service, you will have to: i) publish your data to IPFS yourself; ii) pin it by making a regular HTTP call to a conventional web server, which has a regular DNS domain. Presumably, you must also pay for said service, meaning your HTTP call will probably include an API key or authentication token (not part of the API spec) which you will have to acquire from the provider beforehand.

It is somewhat ironic that such a basic workflow for a decentralized platform like IPFS seems to include what is essentially a Web2 flow in it, particulary when this could be built on top of existing Ethereum infrastructure with blockchain payments and a smart contract instead of a REST API. Pragmatism wins, though, and that is probably what IPFS folks saw as the easiest and simplest for now. There is at least one pinning service, however, which claims to be fully decentralized³ but I have not gone in depth about it.

Getting back to the REST API, it defines endpoints for, well, pinning and unpinning objects. A pin request is a POST request with a payload that looks something like:

{
  "cid": "QmCIDToBePinned",
  "name": "PreciousData.pdf",
  "origins": [
    "/ip4/203.0.113.142/tcp/4001/p2p/QmSourcePeerId",
    "/ip4/203.0.113.114/udp/4001/quic/p2p/QmSourcePeerId"
  ],
  "meta": {
    "app_id": "99986338-1113-4706-8302-4420da6158aa"
  }
}

which contains the CID of the data object to be pinned, a name for the pin, a list of origins (peers which the pin provider may contact to fetch the data it is supposed to store and serve), and some arbitrary metadata.

Given that we are already using a REST API to set a pin, it seems somewhat inconvenient that you need to specify a list of origins instead of just pushing the file contents directly in the body of the POST requests (or something to that effect). This leaves an important gap in the spec, meaning providers will probably try to fill it with different and non-interoperable approaches.

How does Storage and Retrieval Work?

At a high level, files in IPFS are stored as a bunch of blocks arranged in DAG-like structure. Blocks range in size from \(256\)KB to \(1\)MB⁴. We discuss how data is laid out as DAGs, as well as the process for publishing and retrieving data, next.

Data Layout and Files

The basic data objects in IPFS are, as we already stated, blocks. A block is represented by a structure called IPFSObject:

type IPFSObject struct {
  links []IPFSLink
  data []byte
}

Blocks contain an arbitrary payload (data) and an array of links:

type IPFSLink struct {
  Name string
  Hash Multihash
  Size int
}

The astute reader may have noticed that those can function as vertices (IPFSObject) and directed edges (IPFSLink), and therefore can be used to represent arbitrary graphs, of which DAGs are a subset.

Instead of going at length about how those are used to represent a file, let us have a look at a real, somewhat large file in IPFS. The file we are about to inspect represents a public dataset and has about \(1\)GB in size. We begin by retrieving the block that its CID points too, also known as the file’s root object (do try this at home):

> ipfs dag get bafybeiftyvcar3vh7zua3xakxkb2h5ppo4giu5f3rkpsqgcfh7n7axxnsa | jq

{
  "Data": {
    "/": {
      "bytes": "CAE"
    }
  },
  "Links": [
    {
      "Hash": {
        "/": "bafybeibgnyqqkmsoogqh2tfsl64eh7hvztffxjh5f3qnizo56arhfikgim"
      },
      "Name": "nginx-01-02-2021-bank2-sv15.log.tar.gz",
      "Tsize": 970991285
    }
  ]
}

As we can see, the root object contains some data (a byte array with CAE in it, which IPFS presumably uses to identify that this DAG should be interpreted as a file), and a link with the file’s name (nginx-01-02-2021-bank2-sv15.log.tar.gz) and a CID of another object. It is interesting to note that the size field in the link (Tsize) shows \(970\,991\,285\) bytes. Since we know blocks can store data in the ballpark of \(1\)MB, this means that Tsize refers to the accrued size of all blocks in the subdag.

Let us now examine what the linked object (bafy ... fikgim) contains:

> ipfs dag get bafybeibgnyqqkmsoogqh2tfsl64eh7hvztffxjh5f3qnizo56arhfikgim | jq

{
  "Data": {
    "/": {
      # A lot of data...
      "bytes": "CAIYzuP9zgMggIBAIICAQCCAgEAggIBAIICAQCCAgEAggIBAIICAQ ..."
    }
  },
  "Links": [
    {
      "Hash": {
        "/": "bafkreihqypxauakx5gssoyxcpyynxrlxa7meojzwg4vqj6o32lywm32io4"
      },
      "Name": "",
      "Tsize": 1048576
    },
    {
      "Hash": {
        "/": "bafkreidnu3d25o7nawjvfyy7dx4z5qs3tb667dl2ahx32ytnj4laza5yu4"
      },
      "Name": "",
      "Tsize": 1048576
    },

    # ... a very long list of blocks

    {
      "Hash": {
        "/": "bafkreiaoonsu6w4qfzpaa7nuh5kjzlbw6snvvv5h5kzux3pesysduuxfme"
      },
      "Name": "",
      "Tsize": 1048576
    },
    {
      "Hash": {
        "/": "bafkreiazdsmvqjrioya6jm2rjlp5ioas6ykeph2wlccom42fcilqzsva4m"
      },
      "Name": "",
      "Tsize": 1012174
    }
  ]
}

And here things get more familiar: this is basically a list of links to blocks of about \(1\)MB in size, as well as some data (presumably less than \(1\)MB) that is packed in the object itself.

DAGs are “Merkle” DAGs. Because the hash for each object in the graph is computed taking into account the content of the object itself (Data) as well as the hash of the objects they link (.Links[].Hash using jq notation), this effectively means that the hash of a data object depends on the hashes of its children, making IPFS DAGs “Merkle DAGs”. A degenerate dag in this case would be a list, which could be used to store a blockchain-like structure on top of IPFS, though I am not aware of any blockchain built this way.

Publishing Content

Whenever a peer publishes a new file to IPFS, the data is first transformed into a DAG, and then a provider record containing the peer’s address gets pushed into the DHT announcing that the publishing peer is willing to serve that content. As mentioned before, IPFS adopts the original Kademlia replication [3] parameters and publishes provider records over the \(20\) peers closest to the CID (in XOR distance).

Such records must be republished every \(12\) hours, as otherwise DHT nodes holding provider records discard them. This serves two purposes:

it refreshes the number of peers that replicate the record; e.g., if some of the \(20\) original peers have left the network or new peers that are now closer to the root object’s CID have joined;
it garbage-collects records for peers that no longer wish to serve content.

If content is pinned, then the peers pinning it must republish their provider records for as long as the pin holds. Unpinned content is cached and served until the node runs out of storage, at which point such data might be garbage-collected and can no longer be served. It is unclear to me then if peers will actively attempt to remove their provider records from the DHT in the event of garbage collection happening, or if IPFS will simply allow the lease to expire at the risk of having peers contacted for content they no longer hold.

Shall we publish provider records to all blocks in the DHT, or every single one of them?

Note that so far we have not said anything about which part of the Merkle DAG gets published in the provider record. This is because that is a decision made by the peer. By default, however, Kubo (a popular IPFS client written in Go) will index every single CID in the DAG into the DHT.

Retrieving Content

Content retrieval is based on a block exchange protocol named Bitswap [4]. Because DHTs are slow, and because CIDs for child blocks are not necessarily published on the DHT, IPFS relies on a combination of block location strategies and locality assumptions to be able to download a file correctly and with adequate performance.

Base Protocol

A download for a file \(F\) begins with the downloading peer creating an internal – empty at first – session object \(S_F\) which tracks information on peers that are likely to have blocks for \(F\). The dowloading peer then broadcasts the root CID of \(F\) to all of its known neighbors (if any) using a WANT-HAVE message. This tells neighbors that the downloading peer wants to have the block with that CID. Known neighbors might be known from past interactions – i.e., the downloading peer may have downloaded or uploaded blocks to them in the past – or they may come, from, say, the DHT bootstrap procedure.

Once the broadcast happens, we have two possible cases:

At least one neighbor has the block. In that case, at least one neighbor should reply with a HAVE message. As soon as a HAVE response arrives, the downloading peer adds the responding neighbor to \(S_F\) and issues a WANT-BLOCK message back, causing the neighbor to respond with the block. Although not explicitly stated, the downloader will likely ask for the block from other neighbors if the first one fails to reply. Similarly, sending the initial WANT-BLOCK to a small subset of neighbors instead of just one may reduce the latency for this initial fetch.

No neighbors have the block. In this case, Bitswap tries a different location strategy, which can include looking up for provider records for that block in the DHT. Provider records can then be added to \(S_F\), and the peer can retry downloading the block from those.

After acquiring the root block, the downloading peer switches to a round-based mode in which it continues the download process by selecting from peers in \(S_F\) at random to ask for subsequent blocks instead of broadcasting request to its entire neighborhood, which would be untenable in terms of overhead. The random selection strategy is biased to favour nodes which have provided the most blocks for \(F\) (and only for \(F\)) over the download session.

Note that unless the content is overwhelmingly popular, then it is typically the case that the first step in the download will be a DHT (or DNS) lookup. Also note that if CIDs for child blocks (i.e. blocks that are linked from the root block) are not published to the DHT, then it must be the case that this set of peers needs to be able to provide the entire file. And even if CIDs for all blocks are published to the DHT⁵, looking them up one-by-one would be extremely slow.

We can say, then, that Bitswap relies on locality to a large extent as it assumes that peers holding the root CID should be holding the entire file, as otherwise it will either not work or work very slowly.

Neighborhood Expansion and Maintenance Strategies

Because avoiding the DHT is important, Bitswap implements strategies that attempt to increase the set of eligible peers in \(S_F\) over time. We briefly touch upon those next.

Peer Sampling with TTL. Although Bitswap does not implement explicit peer sampling [5], it does implement something analogous to it by means of a TTL (time-to-live) field on WANT-* messages. If the TTL is set to \(\geq 1\) by the sender, then neighboring peers which receive the message will forward it to their neighbors as well, effectively doing a biased random walk over \(n\) hops (where \(n\) is the value of the TTL). This is coupled with a \(d\) parameter which tells the peer to which percentage of its neighbors it should forward the message to. Peers which have the block will end up in \(S_F\), meaning we effectively have a sampling method based on random walks that picks peers from the connected component in which the peer participates.

WANT message inspection. WANT messages are inspected opportunistically by peers which then construct a CID-to-peer mapping table referred to as the peer-block registry. The peer-block registry can be checked for peers that are known to have asked for the CID in the past, effectively making it a source of potential peers for \(S_F\). Indeed, in [4] authors state that Bitswap sessions will first attempt to send direct WANT-BLOCK messages to \(n_{pb} = 3\) peers in the peer-block registry, and only fall back to other strategies if those peers respond negatively to the request.

Pruning. A peer in \(S_F\) is dropped whenever it fails to provide useful blocks for more than a certain number (\(16\), again according to [4]) of requests.

And this largely concludes my exploration of IPFS for now. I will probably revisit those subjects in more detail as I work on my own project, particulary shortcomings on the DHT and Bitswap, and its comparison to protocols like BitTorrent.

[1]

D. Trautwein et al., “Design and evaluation of IPFS: A storage layer for the decentralized web,” in Proc. Of the ACM SIGCOMM’22, 2022.

[2]

J. Benet, “IPFS - content addressed, versioned, P2P file system,” CoRR, vol. abs/1407.3561, 2014.

[3]

P. Maymounkov and D. Mazières, “Kademlia: A peer-to-peer information system based on the XOR metric,” in Proc of the 2002 IPTPS, 2002, pp. 53–65.

[4]

Y. P. Alfoso de la Rocha David Dias, “Accelerating content routing with bitswap: A multi-path file transfer protocol in IPFS and filecoin,” in Protocol Labs Research Report.

[5]

M. Jelasity, R. Guerraoui, A.-M. Kermarrec, and M. van Steen, “The peer sampling service: Experimental evaluation of unstructured gossip-based implementations,” in Proc. middleware’04, 2004.

Which the whitepaper [2] states is based on S/Kademlia but is not ↩︎
Indeed, I wonder if reading the whitepaper these days helps or hurts – a lot of stuff got dropped or was never implemented at all.↩︎
https://medium.com/aleph-im/pioneering-decentralized-ipfs-pinning-services-fe31d9d5817b ↩︎
Though, as with most things in IPFS, it is not entirely clear that this is the limit, with some people claiming having seen maximum block sizes of up to \(4\)MB (as of 2021).↩︎
Which seems to be the default behavior for Kubo ↩︎

dht kademlia security