How the Internet Works 9: HTML

In today’s post I want to demystify the term HTML. HTML stands for HyperText Markup Language, and it is the way that you write web pages. A markup language, according to Wikipedia, is a way to “annotate the text” to provide some additional information.

Every webpage has HTML, which includes a set of “tags” to describe different parts of the page. A tag can be something like <a>, <img>, <html>, <p> and many more.

Here is a simple HTML page:

<HTML>
  <HEAD>
    <TITLE>My Page</TITLE>
  </HEAD>
  <BODY>
       <H1>Welcome to my page!</H1>
  </BODY>
</HTML>

There are a few things you can notice here. Tags start like this <NAME>: and end like this </NAME>. Tags can go underneath other tags. The outermost tag is the <HTML> tag, which says this is a html page. There is a <TITLE> tag, which says the title of this page is “My html page”. There is a tag <H1> which says that there is a big header on the page that says “Welcome to my page!”

The indenting shows how tags are related to each other. A tag that is more indented is said to be a “child” of the tag before it. The structure of a HTML page is said to create a “tree.”

If I want to expand my current example to include an image to a file named “dog.jpg”, I would write

<img src="dog.jpg" />

This image tag has a source attribute which tells us the filename is “dog.jpg.” Different attributes tell us whether the content should be a list, a header, a paragraph, or some different section of text. These tags can be annotated with other attributes that determine how they are styled (CSS) or what to do when certain events happen (JavaScript), but these are topics for a later date.

The main point is that webpages are just made up of html which is a series of tags that add additional information to plain text. If you want to write html, just open a text editor, copy and paste that first example and save it as a file called “index.html”– and you can open that up in a web browser.

How the Internet Works 8: Bytes, Megabytes, and More

I wrote recently about how everything in computers is stored as 0s and 1s, and the language of computers is binary. You can read about that here. However, one bit or a few bits doesn’t really contain that much information. In general you will be dealing with thousands, or millions, or billions, or even more binary digits!

There are a set of prefixes that are associated with binary data, and here is what they all mean.

First a bit is just a single 1 or 0.

If you take 8 bits and put them together we call that a byte.

One thousand bytes is called a kilobyte (shorthand kB). The prefix kilo means 1,000, like kilogram. One minor thing is that computers always represent things in powers of 2, so although the prefix kilo means 1,000 on a computer it is more likely that a kilobyte is 1,024 bytes, because 1,024 = 2^10– the closest power of two to 1,000. A small text file on your computer is probably a few kilobytes up to a hundred kilobytes.

One million bytes is called a megabyte (shorthand MB). The prefix mega means 1,000,000. For a computer the exact number is the closest power of 2 which is 2^20. A standard mp3 file on your computer is probably somewhere from 3 to 10 megabytes (that is 3 to 10 MILLION bytes, or 24 to 80 MILLION 1s and 0s!).

One billion bytes is called a gigabyte (shorthand GB). The prefix giga means 1,000,000,000. For a computer the closest power of 2 is 2^30. A full-length video of medium quality is probably around a gigabyte in size.

Just for some more reference, the Google homepage just sent 32.1 kB of data. Just loading the home page of thekeesh.com was 440.98 kB, but doing it a second time was only 12.36kB of data. An image file in general is from tens of kilobytes to a few megabytes.

There are more prefixes bigger than giga–

terabyte is one thousand gigabytes
petabyte is one thousand terabytes
exabyte is one thousand petabytes
zettabyte is one thousand exabyte
yottabyte is one thousand zettabyte.

You can buy a terabyte hard drive these days for about $100 or less on a quick Google search, which is pretty crazy.

From the wikipedia page on zettabyte:

As of February 2012, no storage system has achieved one zettabyte of information. The combined space of all computer hard drives in the world was estimated at approximately 160 exabytes in 2006… As of 2009, the entire Internet was estimated to contain close to 500 exabytes. This is a half zettabyte.

So that is a basic introduction to some of the metric prefixes as applied to bytes.

Lightweight Deployment With Git

I’ve used this setup so many times now that I wanted to write up a list post explaining how I do it. This method is something I found on this site, but I just wanted to add my own commentary.

The idea is that you are developing a website locally using git, and you want to be able to easily push to your live site with git. We will set up a remote repository on your server, and push to it.

Let’s say you have your local repository already set up. Log on to your remote machine. Navigate to a directory where you want to keep your repository. This does not necessarily need to be the same location as your code, and you actually probably want it to be a different place.

### On the remote machine
corn03:/afs/ir/group/paperless2> mkdir repo.git && cd repo.git
corn03:/afs/ir/group/paperless2/repo.git> git init --bare
Initialized empty Git repository in /afs/ir.stanford.edu/group/paperless2/repo.git/

The way this is going to work, is that we are going to create a post-receive hook. A post-receive hook means you can run some script after this repo has received a push. What we are doing here is checking out the current code to some directory, which we define as the GIT_WORK_TREE. You can make the GIT_WORK_TREE wherever you want. Then we make the script executable.

corn03:/afs/ir/group/paperless2/repo.git> cat > hooks/post-receive
#!/bin/sh
GIT_WORK_TREE=/afs/ir/group/paperless2/cgi-bin git checkout -f
corn03:/afs/ir/group/paperless2/repo.git> chmod +x hooks/post-receive

Now, on your local machine, add the remote you want to push to. On this first one, make sure you include the branch you are including, like master.

## Locally
> git remote add web ssh://jkeeshin@corn23.stanford.edu/afs/ir/group/paperless2/repo.git
> git push web master

For any future updates

> git push web

It’s pretty basic, easy to use, and works for lightweight deployment for a site by yourself or with a few other people.

The only issue I had was one time, my internet connection went down in the middle of deployment and the git process crashed. Then the next time I tried to push, nothing happened. After a little bit of searching, the fix was that there was a file “index.lock” that was created, and once we removed that file, it worked again.