Sven Kadak

Content Format

Customizable XML-like language for user generated content

All online platforms that let users create posts with rich text content need to settle on some sort of a format or special notation that describes the nature of a post. The most popular and wide spread content format of our time is HTML that is used by the whole world wide web. Another popular choice is Markdown, which is, for example, used by Stack Overflow and Trello.

Content format is an XML-like language that is specifically designed to be used in environments that work with user-generated content.

Problems with HTML

Although HTML is the most used content format, it is also the most insecure format one could use for user-generated content. Numerous websites that have used HTML as their content format have been the subject of cross-site scripting (XSS) attacks. XSS is among the top ten web application security risks and should be taken seriously when using it for user-generated content.

Another problem with HTML is that it's a big and complex language. This also contributes to why it's hard to prevent XSS exploits. Additionally, HTML is inconsistent and, in some aspects, unintuitive.

Problems with Markdown

The biggest flaw with Markdown has been that for a long time, there was no official specification and people implemented it as they interpreted it, creating a situation where many different flavors with distinct semantics existed.

When blindlessly using a random Markdown library, there is also a high possibility of XSS because the usage of HTML is not only allowed but an encouraged practice when trying to accomplish something that is not supported by Markdown's syntax.

Benefits of Content Format

Drawbacks of Content Format

Example

<*Heading level="1">
  this title is commented out
</*>

<Link to="https://example.com">link</>

<Paragraph>
  Lorem ipsum dolor sit amet, consectetur <Bold>adipiscing</> elit.
  Nulla <Link to="#"><Bold>eu <Italic>orci</></></>, 
  imperdiet ipsum eu, viverra velit. Vestibulum accumsan <Code>orci sed elit 
  venenatis</> malesuada. Quisque eget massa sollicitudin, finibus magna a,
  dapibus nunc.
</>

<YouTube id="YE7VzlLtp-4">

<Code language="rust">
fn main() {
    println!("hello, world!");
}
</>

<Paragraph>Ingredients:</>
<List type="ordered">
  <ListItem>1 <Bold>frozen banana</>
  <ListItem>1 <Bold>Orange</>
  <ListItem>1/2 cup <Bold>almond milk</>
  <ListItem>1/2 cup <Bold>ice</>
  <ListItem>1 teaspoon <Bold>vanilla extract</>
  <ListItem>1 teaspoon <Bold>baobab powder</>
</>

The example above produces the following output:

example.com

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla eu orci, imperdiet ipsum eu, viverra velit. Vestibulum accumsan orci sed elit venenatis malesuada. Quisque eget massa sollicitudin, finibus magna a, dapibus nunc.

fn main() {
    println!("hello, world!");
}

Ingredients:

  • 1 frozen banana
  • 1 Orange
  • 1/2 cup almond milk
  • 1/2 cup ice
  • 1 teaspoon vanilla extract
  • 1 teaspoon baobab powder

ReasonML library usage

To use Content Format, all elements must be first defined. Let's define a Paragraph that besides text and whitespace allows only Link and Image elements as children.

Link allows only text and whitespace as children. It has a single attribute to which specifies the URL.

Image is a singleton element that exists on its own and allows no children. It has a single attribute source which specifies the URL.

let elements = Js.Dict.empty();

Js.Dict.set(
  elements,
  "Link",
  ContentFormat.ElementDefinition.makeContainer(
    ~validateAttributes=
      Some(
        attributes => {
          let validAttributes = Js.Dict.empty();
          let to_ = Js.Dict.get(attributes, "to");
          switch (to_) {
          | Some(to_) =>
            Js.Dict.set(validAttributes, "to", to_);
            Some(validAttributes);
          | None => None
          };
        },
      ),
    ~children=
      ContentFormat.ElementDefinition.Node(
        node =>
          switch (node) {
          | Text
          | Whitespace => true
          | _ => false
          },
      ),
    ~render=
      (attributes, children) => {
        let to_ =
          switch (attributes) {
          | Some(attributes) =>
            let to_ = Js.Dict.get(attributes, "to");
            switch (to_) {
            | Some(to_) => to_
            | None => ""
            };
          | None => ""
          };
        let attributes = {j| href="$to_"|j};
        {j|<a$attributes>$children</a>|j};
      },
    (),
  )
  ->ContentFormat.ElementDefinition.Container,
);

Js.Dict.set(
  elements,
  "Image",
  ContentFormat.ElementDefinition.makeSingleton(~render=attributes => {
    let source = Js.Dict.get(attributes, "source");
    switch (source) {
    | Some(source) =>
      let attributes = {j| src="$source"|j};
      {j|<img$attributes>|j}->Some;
    | None => None
    };
  })
  ->ContentFormat.ElementDefinition.Singleton,
);

Js.Dict.set(
  elements,
  "Paragraph",
  ContentFormat.ElementDefinition.makeContainer(
    ~children=
      ContentFormat.ElementDefinition.Node(
        node =>
          switch (node) {
          | ContentFormat.ElementDefinition.Element(name) =>
            switch (name) {
            | "Image"
            | "Link" => true
            | _ => false
            }
          | Text
          | Whitespace => true
          },
      ),
    ~render=(_, children) => {j|<p>$children</p>|j},
    (),
  )
  ->ContentFormat.ElementDefinition.Container,
);

let document = ContentFormat.parse(options, input);