MIT 6.824: Distributed Systems Lecture 1

Darshan kadu
4 min readFeb 7, 2022

Distributed Systems is one of the most discussed topic among backend engineers apart from the WEB-3. So this series about the summary of MIT 6.824: Distributed Systems classes. I will try to summaries what I learned in those classes and do best to put it in simpler words.

Why Distributed System is so important?

Consider Facebook, Amazon, Google search this website needs to serve and store huge amount of data but you cant store that much amount of data on single machine. Hence we need to distribute that data over several computers.

Things like map reduce where you have to parse large file to do operations like invert indexing, counting word frequency cant be done on single machine cause of memory, time, CPU constraint hence needed distributed systems

Peer to peer file sharing is also one example of why we need distributed system.

Ideally we would like to solve all problems in one computer but not all problems can be solved like that because of several reasons like we would want to achieve parallelism to process large files, want to have fault tolerant system and hence have multiple systems in different locations, serve systems like banks which can be in different continents.

Challenges in Distributed Systems:

Distributed Systems has many part concurrently running, they are also spread over network. Now each part can fail which makes things complex as you need to handle the failure scenarios carefully else we will result in some unexpected behaviors.

High performance is not so easy, if you add 100 machine you don’t expect task to perform 100x faster cause its tricky to tune everything together and handling all the bottlenecks.

Customer needs of Distributed Systems:

Customer needs infra where Distributed things are abstracted. The main abstractions are

  • Storage: They don’t need to worry how data is stored across machines and for them it should feel like its just a single machine storing data.
  • Computations: They don’t need worry about underlying tech about managing tasks like map reduce which require managing cluster and all instead they should just write the logic and all the infra is handled by abstraction provider.
  • Communication: Things like network and reliability of it .

Performance of Distributed Systems:

Scalability is tricky to handle i.e you need not get 2x speed with 2x machines,here is how:

Lets say the user request flow is like :

users- > web server → database

With increase in traffic we add more web servers, but if all users are reading the same db then all request will will hit the DB hence DB is bottleneck and we need to store data in multiple machines which will again have problems like consistency.

Properties in Distributed Systems:

Fault tolerance: Single computer may stay live for lifetime but with 1000s of machines probability of failures of some of the machines is high( cause of network, hardware, some random problem)

Availability: System/services will keep operating despite of failure (like 2nd replica server fails but 1st server does the job)

Recoverability : After the system repair, it should work as it as it was at the moment of failure. for ex data shouldn’t not be lost after the recovery or shuoldnt behave unexpectedly.

Replication → It is nothing but multiple servers with identical copy of data stored on them. Syncing is the issue with replication as two servers can be in same data center or in different. There are high chances of them being out of sync.

Consistency: Lets say here is key-value(kv) store, it does

put(k,v), get(k)→ v

These 2 operations are simple but not in Distributed Systems. Several issues can be there like replica are not in sync some machine is dead.

Its easier to build system with weaker consistency and strong consistency is hard to build . Both are good depending on the usecase.

If replica is for the fault tolerance then they both should have independent failure probability. ex. putting them in different continents but it also increased the lag time between two replicas which is the cost of strong consistency

MapReduce (a case study):

Google did a paper over it please go and read it.

MapReduce starts with input file or files, Map functions process each file or chunk of it and produces list of file which are takes by reducer to reduce them .

So a input files get split into for “m” map tasks and then converted into files for “R” reduce task.

It can look like this for word count job.

map(k,v):// v is file

split v into words

for each word w

emit(w, “1”)

Reduce(k,v):

emit(len(v))

Emits are done in files explained more in the GFS (cluster file system),

It distributes a big file over GFS servers. So running map reduce on this file is easy as the file is already distributed among the server and getting large read throughput is already handled. Please read the paper linked above its way more interesting:)

Thats all I had for this post. I hope you learned something new. If you liked it and want to read more, please share this and subscribe to the Newsletter .

If you have any feedback, please feel free to comment or reply to the email.

Originally published at https://darshankadu.substack.com on February 7, 2022.

--

--