Inquisitor – A Footprinting Approach

Open Source Intelligence (OSINT), or more precisely, the use of open source intelligence sources to profile the internet exposure of organisations (i.e. footprinting), is one of my favorite areas in information security particularly because it’s more or less of an open-ended problem at the moment.

There are a lot of interesting things that you can do with stuff that you can find in the open. For example, if you can manage to get your hands on a list of IP blocks that belong to an organisation that you’re doing a penetration test for, you can do a sweep on it and check for live hosts. Another example is, if you can manage to get your hands on a list of email addresses that belong to an organisation, you can iterate over it and cross-check each email address with a service such as HaveIBeenPwned in order to find various accounts belonging to your target organisation’s personnel that might have been compromised before.

A thing to note about open source intelligence is that it’s not just a practice for asset discovery and neither is it a practice that’s restricted to information security altogether. I’d personally describe it as a practice of collecting publicly available information from various sources, and analyzing the collected information in order to create a model which can be used to make decisions. Businesses and investors also make use of open source intelligence in order to perform competitive and market analysis among other things.

Table of Contents

OSINT and Information Security

[ Back to Table of Contents ]

In the context of information security, open source intelligence can either be used offensively or defensively.

Defensive Use: Threat Forecasting

[ Back to Table of Contents ]

I would say that a potential defensive application of OSINT is threat forecasting. You might be able to do this by monitoring chatter on various online platforms and checking for red flags in order to forecast potential threats against your organisation. For example, you can monitor PasteBin for password leaks that might put your organisation at risk.

You’re probably going to need to do some pretty advanced stuff involving machine learning and natural language processing if you plan on automating this practice because you’re inevitably going to have to sort through a lot of unstructured information when dealing with chatter expressed in human language.

If I’m going to build an automatic threat forecasting system, I’d probably gather some information on what constitutes a red flag first and then determine what sort of metrics I can extract from that so that I’d have something to feed into a machine learning algorithm later on.

Perhaps I can gather a bunch of Facebook and Twitter posts threatening my organization and then create a bag-of-words or an n-gram model from them which I can then use to train a Naive-Bayes classifier to detect other posts that threaten my organization. This is similar to how spam filters work by the way.

Once I feed my metrics into a machine learning algorithm, it will produce a classifier, which, in turn, will allow me to automatically determine red flags in the future. Of course, this classifier will not be completely accurate and will have to be corrected with additional data as the system becomes more mature.

Alternatively, you can go for a sort of emphatic scoring system. In this approach, you try to determine how threatening a message is based on the words that it contains.

Let’s say you’re a cereal manufacturer and there’s some online chatter about you containing the words “exploit” or “vulnerability” or “leak” or some other technical words that don’t blend well with breakfast-related products. If you see something like that, that might indicate some sort of cyber threat against your organisation.

Offensive Use: Footprinting

[ Back to Table of Contents ]

As for potential offensive applications of OSINT: you can use it for footprinting as I earlier discussed in the opening paragraph of the article.

This is useful when you’re in the reconnaissance phases of penetration testing as you can leverage on the information provided by footprinting when performing other reconnaissance tasks such as scanning and enumeration. For example, as described in the opening section of this article, you can subject IP address ranges that you find via footprinting to port scans in order to find live hosts in your target’s networks.

You’ll also find the results of footprinting to be pretty useful in other non-reconnaissance-related penetration testing tasks such as when conducting vulnerability scans and performing exploitation tasks. Footprinting essentially allows you to have a map of your target organization’s assets, hence, you get a pretty good idea where to start probing for vulnerabilities.

Existing Footprinting Solutions

[ Back to Table of Contents ]

While tools such as Maltego and recon-ng do exist, as far as I know, they don’t automatically determine whether assets belong to your target or not. You have to manually sift through the acquired assets and pick out which ones belong to your target and which ones do not.

If you’re doing a penetration test for a client and you accidentally attack the wrong server, there might be a chance that you end up behind bars or paying some really large fine. Either way, the results aren’t going to be pretty so picking out the correct targets for attack is pretty important.

Also, for some of the tools mentioned above, you may also need to manually invoke the transformation directive of each of the assets that you have discovered, hence, adding additional work onto the plate of their operators.

Inquisitor

[ Back to Table of Contents ]

This project is located at https://github.com/penafieljlm/inquisitor.

This is my personal take on the footprinting problem though it is heavily inspired from how other existing footprinting tools (Maltego and recon-ng) operate. This approach includes additional goals that aren’t usually covered by your typical everyday footprinting solution such as automatically determining whether or not a specific asset belongs to a penetration tester’s target organization.

I’m gonna go ahead and point out that not everything can be automatically classified and we’re still going to need some guidance from the operator to ensure that the assets and their classifications are accurate in the end, but our goal is to minimize the effort required of the operator and make it easy and efficient for them to accurately find and classify assets.

Difference From Other Solutions

[ Back to Table of Contents ]

Rather than simply encapsulating OSINT sources, tools, and transforms, Inquisitor encapsulates an entire footprinting approach altogether. It’s an opinionated tool. It has opinions on what sort of semantics each asset type follows, how certain asset types are to be transformed, what the logical relationships between asset types are, how each asset type is identified, what attributes about an asset should be recorded, how assets should inherit classification labels from other assets, among other things.

Inquisitor is also an organization-centric tool. It’s not a general reconnaissance tool like recon-ng or Maltego. It revolves around the goal of profiling the internet exposure of organizations and not individual users or some other entity types, so expect that the “opinions” built into Inquisitor were considered under that context.

Finally, Inquisitor also seeks to determine which assets actually belong to the target organisation. This will be discussed in more detail below.

Goals of Inquisitor

[ Back to Table of Contents ]

The main goal of Inquisitor is to easily and accurately profile the internet exposure of target organisations. That means we want little if not zero false positives from our results (false negatives are fine). We want to avoid attacking the wrong assets later on when we get to the other steps of penetration testing. This is especially useful when you’re trying to automate vulnerability assessments for an entire organisation from the outside.

In order to achieve the main goal, I layout the following sub-goals for Inquisitor:

Automatic Asset Classification

[ Back to Table of Contents ]

As much as possible, if an asset can automatically be classified with one-hundred percent certainty, it should be done.

For example, if the registrant name of a domain has been manually classified as belonging to the target organisation, then the domain’s ownership classification should inherit that. If the registrant name has not been manually classified, then we can defer to inheriting the classifications of its parent domains.

The example above only applies for domain name assets though. A different asset type would require a completely different classification inheritance logic. As mentioned previously, Inquisitor is an opinionated tool and opinionated approaches to asset classification inheritance have also been built into it.

Easy Manual Inspection and Classification of Assets

[ Back to Table of Contents ]

In cases where we are not able to automatically classify assets, we should be able to easily determine the non-classified assets that are likely to belong to the target organisation by using some sort of heuristic approach. A possible approach would be to quantify how similar a specific asset’s attributes are to the attributes of the assets that are known to belong to the target organisation and then ordering the non-classified assets from most likely to belong to the target organisation to least likely to belong to the target organisation before being presented to the user for manual classification.

The logic to quantify an asset’s divergence from the target organisation’s known owned assets is going to depend on what type of asset is being handled, and in the context of Inquisitor, it’s going to be implemented in an opinionated form as well.

Automatic and Appropriate Asset Transformation

[ Back to Table of Contents ]

Of course, as with any reconnaissance tool, we’d want to be able to apply the appropriate kind of transformation logic to every asset type that we want to handle and we would want to be able to do this automatically for assets that are known to be owned by the target organisation.

The transformation logic, would, of course, vary between the different types of assets that we want to handle. For example, you can’t transform an IP Block the same way that you would transform a domain name because the semantics surrounding them are different. Let’s take for example the transformation directive “retrieve the  subdomains of a domain”. One of the approaches that you could take to implement this directive is to go on Google and enter host:coca-cola.com. That sort of transformation directive will not work on, say, an IP Block because the query host:52.0.0.011 simply does not make sense. An IP block has to be handled under its own semantics and so does every other asset type.

Non-Use of Enumeration Techniques

[ Back to Table of Contents ]

Lastly, since Inquisitor is a footprinting tool (i.e. it attempts to find target assets in a non-“intrusive” way), we want to avoid employing exhaustive enumeration techniques on our target’s assets.

Sure, we might get more thorough results if we do a complete sweep of our target’s IP ranges, but chances are, our attempts to probe their network will not go unnoticed. One of the reasons that we’re using open sources is to keep our target from knowing about our reconnaissance maneuvers, so that’s why we would want to avoid hammering them with probe requests.

Other than that, enumeration techniques typically fall under the scope of the scanning phase of penetration testing, so it’s really not our concern here, but note that footprinting and scanning go hand-in-hand when profiling the entire internet-facing infrastructure of an organization.

Footprinting Concepts

[ Back to Table of Contents ]

Now that the goals for Inquisitor have been defined, let’s go over some basic footprinting concepts that you need to know in order to understand how Inquisitor works.

Seeding

[ Back to Table of Contents ]

Before Inquisitor starts looking for the assets of our target organisation, we first need to give it some information that we already know about our target. Inquisitor can’t just characterize organisations without any information to start off from, so we’re going to have to give it some hints before it starts running. If we don’t seed Inquisitor, we’re not going to get anywhere.

Anyway, the process of “seeding” can basically be described as enumerating “facts” that we know about our target prior to conducting transforms. In the context of Inquisitor, facts include, “the target organisation registers some of its assets under the name The Coca-Cola Company” or “the target’s primary website is coca-cola.com” or “the network 208.53.200.104/29 belongs to our target organisation”.

I believe the concept of seeding is pretty much the same when using Maltego or recon-ng minus the asset classification label portion of the initial information provided to the tool.

Canonicalization

[ Back to Table of Contents ]

The process of canonicalization refers to expressing the various representations of a specific asset into a “standard” form which is also referred to as its “canonical form”.

For example, in the case of discovering domains names of a target organisation, we might acquire two domain names represented as ᴡᴡᴡ.Coca-Cola.com and ᴡᴡᴡ.coca-cola.com. It’s pretty obvious that these two domain names refer to the same domain, but when you compare them using an equality operator, the results will indicate that they are not the same because one has uppercase letters and the other does not. This is where canonicalization comes in. If we canonicalize both the ᴡᴡᴡ.Coca-Cola.com and ᴡᴡᴡ.coca-cola.com domains, we’ll end up with the ᴡᴡᴡ.coca-cola.com and ᴡᴡᴡ.coca-cola.com domain names respectively. When both domain names are expressed like this, equality operators will have no problem determining that these two domains are one and the same.

This works only for the domain name asset type however. Different asset types will demand a different canonicalization method depending on the semantics surrounding them.

Transforms

[ Back to Table of Contents ]

The process of performing transforms can basically be described as looking up information that we have yet to know about the target based on the information that we already know about the target.

If we know that the domain coca-cola.com belongs to our target, then we can perform a bunch of operations which will help us enumerate more assets that might belong to our target based on that fact.

For example, we can perform a site:coca-cola.com query on Google which will give us a list of domain names under coca-cola.com such as dunkinanytime.coca-cola.com. Note that in order to perform this query, we needed to know that coca-cola.com belonged to our target first.

Inference

[ Back to Table of Contents ]

The process of inference simply involves inferring the existence of other assets based on the fact that a specific asset and its metadata exists. Inference requires no additional lookup to be performed, and instead, other assets are directly derived from an asset and the metadata attached to it (such as whois information).

For example, if we know that ns1.coca-cola.com exists, we can infer that its parent domain coca-cola.com exists without performing any sort of lookup whatsoever. Another example would be, if the email admin@coca-cola.com exists, then we can directly infer that the coca-cola.com domain exists.

Like transforms, the process of inference depend heavily on the semantics of the asset type being subjected to the inference process.

Classification

[ Back to Table of Contents ]

The classification process is where we determine whether or not an asset belongs to our target organisation.

There are three classification labels: acceptedrejected, and unclassified.

The accepted classification label indicates that an asset is owned by the target organization. The rejected classification label indicates that an asset is not owned by the target organization. The unclassified classification label indicates that an asset hasn’t been classified as either accepted or rejected yet.

The classification process depends heavily on the semantics surrounding the asset type being subjected to classification especially if automatic classification is being attempted.

In Inquisitor, assets may be classified automaticallymanually, or via a rule. Manual classifications take precedence over rules-based classifications, and rules based classifications take precedence over automatic classifications.

Automatic Classification

[ Back to Table of Contents ]

In Inquisitor, automatic asset classification operates by inheriting the classifications of related assets. As stated previously, this process depends heavily on the semantics surrounding the asset type being subjected to classification.

For example, a domain can automatically classify itself based on the classification of its parent domain or its registrant name – that is, if either the registrant name The Coca-Cola Company or the domain name coca-cola.com is owned by our target organisation, then we know that ns1.coca-cola.com is also owned by our target organisation.

Naturally this process differs between different asset types. For example, email addresses don’t have registrant names and must defer to the ownership classification of their domains.

Manual Classification

[ Back to Table of Contents ]

If an asset cannot be automatically classified or if an incorrect classification was made, then the user can manually classify or re-classify it.

The manual assignment of an asset to a specific classification takes precedence over all other classification methods. That means that if the domain coca-cola.wordpress.com was manually classified as accepted, it will always be classified as accepted even if its registrant name or parent domain were classified as rejected, and even if a rule rejecting domains matching *.wordpress.com is present.

Assets with manual classifications actually serve as starting points for the whole classification inheritance process described above. The whole concept of the seeding process is actually based upon manually introducing and classifying assets that aren’t recorded on the intelligence database yet.

The more assets that you manually classify, the better the coverage and accuracy of your results will be since the system has more “facts” to infer conclusions from. It’s sort of like machine learning in a sense.

Rule-Based Classification

[ Back to Table of Contents ]

Rule-based classification allows for the definition of accept/reject rules that assets matching a specific criteria will inherit. For example, one can declare that all domains under amazon.com be automatically rejected via the rule reject domain.name~=.*\.amazon\.com.

Just imagine an access control list, but for Inquisitor assets.

Acceptability

[ Back to Table of Contents ]

The process of determining asset acceptability aids the process of manual classification by allowing potentially owned assets to be sorted according to their similarity with assets that are known to belong to the target, thus allowing assets that are most likely to belong to the target to be evaluated and manually classified first.

Like some of the concepts introduced above, the process of determining the acceptability rating of an asset heavily depends on the semantics of the asset type being subjected to acceptability rating quantification.

For example, one of the approaches that can be taken to quantify the acceptability rating of a certain domain name is to check if its zones contain substrings that are common among domains that have been classified as accepted. If many domains that have been classified as accepted contains an elevated instance of the substrings coca and cola, and the domain we’re evaluating contains those two words, then that should raise the acceptability rating of the domain that we are currently assessing.

Parental Relationships

[ Back to Table of Contents ]

For the purpose of visualising the assets that were found during OSINT analysis, it is required that assets have the ability to know which other asset best serves as their “parent”.

The process of determining which parent asset a specific asset belongs to is entirely dependent on the semantics surrounding the asset in question.

For example, domain names and network blocks can be registered under a specific registrant name, making that registrant name effectively the “parent” of those two assets. A host can also exist under another domain name such as in the case of ᴡᴡᴡ.coca-cola.com and coca-cola.com, etc. If a specific asset has no parent, such as in the cases of registrant names, they directly fall under the target organisation itself.

Types of Assets

[ Back to Table of Contents ]

While there maybe countless types of assets that can potentially be subjected to OSINT discovery, the initial scope of Inquisitor includes only the most common “internet objects” for now. These include internet objects such as registrant names, network blocks, domain names, email addresses, and social media accounts. Each will be discussed individually below.

Registrant Names

[ Back to Table of Contents ]

Registrant names are the names that organisations use to register their assets to various registries and authorities.

For example, when reserving a network block on an internet registry such as ARIN, an organisation may use names such as The Coca-Cola Company in their request for a network block. Once the request is approved, ARIN binds that registrant name to the network block that has been allocated to the organisation in question.

Organisations tend to use the same registrant names when registering for other things such as domain names and X.509 certificates, so it’s a pretty useful determinant of asset ownership. Additionally, the authorities which manage network blocks, domain names, and X.509 certificates also verify the entities behind registrations thus allowing us to have more confidence in the authenticity of the relationship between an asset and a registrant name.

Canonicalization

[ Back to Table of Contents ]

To canonicalize registrant names:

  1. Strip leading and trailing spaces
  2. Remove extra spaces between words
  3. Perform unicode transliteration
  4. Convert to uppercase

A canonicalised registrant name should look something like THE COCA-COLA COMPANY.

Inference

[ Back to Table of Contents ]

There are no assets that you can infer from the registrant name alone.

Transforms

[ Back to Table of Contents ]

To discover assets related to a registrant name:

  1. Internet Registry Transform
    • Sources
      • ARIN
      • RIPE-NCC
      • APNIC
      • LACNIC
      • AFRINIC
    • Inputs
      • Registrant Name
    • Outputs
      • Registrant Names
      • Network Blocks
      • Email Addresses
  2. Google Transform
    • Sources
      • Google
    • Inputs
      • [registrant-name]
      • site:facebook.com [registrant-name]
      • site:linkedin.com [registrant-name]
      • site:twitter.com [registrant-name]
      • site:instagram.com [registrant-name]
      • site:github.com [registrant-name]
    • Outputs
      • Domain Names
      • Email Addresses
      • Social Media Accounts
  3. Shodan Transform
    1. Sources
      • Shodan
    2. Inputs
      • [registrant-name]
      • org:"[registrant-name]"
    3. Outputs
      1. Registrant Names
      2. Domain Names

Classification

[ Back to Table of Contents ]

If the registrant name has not been manually classified as belonging to the target organisation, then it does not. Registrant names have no means of inheriting classification labels from other assets.

Parent

[ Back to Table of Contents ]

Registrant names are top-level objects and they directly fall under the target organisation.

Acceptability

[ Back to Table of Contents ]

The computation of the acceptability rating of an unclassified registrant name involves removing all unicode transliterated NLTK stopwords from the registrant names to be compared and then calculating the maximum Needleman–Wunsch similarity score between all accepted registrant names and the unclassified registrant name being rated.

This GitHub project provides a good implementation of the Needleman–Wunsch algorithm. The best part about it is that it outputs an identity score that we can use directly.

Here’s a sample of the algorithm in action:

>>> alignment.needle('COCA-COLA COMPANY', 'COCA-COLA HELLENIC BOTTLING COMPANY') 
Identity = 48.571 percent
Score = 80
COCA-COLA------------------ COMPANY
COCA-COLA COMPANY
COCA-COLA HELLENIC BOTTLING COMPANY
>>> alignment.needle('COCA-COLA COMPANY', 'PEPSICO, INC')
Identity = 23.529 percent
Score = -25
COCA-COLA COMPANY
 CO N 
PEPSICO-, ----INC
>>>

Post-Footprinting

[ Back to Table of Contents ]

Registrant names don’t really have a use outside of footprinting or scanning.

Network Blocks

[ Back to Table of Contents ]

Network blocks are essentially ranges of IP addresses that are reserved by an internet registry for the use of a specific organisation. These IP blocks can serve as targets for scanning in subsequent phases of penetration testing.

Note that not all hosts under an IP block are automatically owned by the organisation owning the IP block. Some organisations decide to delegate some sub-ranges of their blocks to other organisations, hence, it’s possible that some hosts under an IP block may be out of scope for your penetration testing project. Usually, in these cases, the registrant name of the most immediate IP block parent of an IP address can be checked to determine whether the IP address in question does indeed belong to your target organisation or not.

There’s another thing that needs to be noted when it comes to ownership of IP addresses especially when it comes to PaaS and IaaS providers, and that is: the fact that an organisation owns an IP address does not immediately mean that it owns the services running on that IP address.

Take for example, Amazon Web Services (an IaaS provider). While you can rent out one of their machines to host your organisation’s business applications, the IP that will be allocated to the machine that you’re renting will still fall under Amazon’s IP address range. Technically the IP address belongs to Amazon, but clearly, the service that’s running on it belongs to you. You need to keep this in mind if you wanna avoid unintentionally going out of scope during penetration testing. However, you won’t usually encounter this problem unless the organisation that you’re doing footprinting for offers a service delegating its IT resources to third parties (such as Amazon).

Whois information should be attached to a network block object upon initialisation.

Canonicalization

[ Back to Table of Contents ]

Network blocks should be expressed in CIDR format. If you’re implementing canonicalization in code, you’d probably be better off using some third party package which parses CIDR strings such as the netaddr package in Python.

A canonicalized network block should look something like 52.0.0.0/11.

Inference

[ Back to Table of Contents ]

From the whois information attached to the network block object upon initialization, we may be able to infer the following:

  1. Registrant Name
  2. Network Blocks
    1. Parent Network Blocks
    2. Child Network Blocks
  3. Email Addresses
    1. Administrative Contact
    2. Abuse Contact
    3. Tech Contact

Transforms

[ Back to Table of Contents ]

To discover assets related to a network block:

  1. Shodan Transform
    • Sources
      • Shodan
    • Inputs
      • net:[network-block]
    • Outputs
      • Registrant Names
      • Domain Names

Classification

[ Back to Table of Contents ]

If the network block has not been manually classified as belonging to the target organisation, then it defers to the classification label of the registrant name associated with it.

We don’t inherit the classification labels of parent network blocks because that allows for false positives to seep into the classification inheritance process. For example, if your target organisation delegates network blocks to other organisations (e.g. Verisign), then everyone you delegate network blocks to will inherit the classification label of the “master” network block that you’re dividing for everyone. That will be very bad especially if you’re planning to do scanning and enumeration later on since you’re treating all networks blocks under the “master” network block as if they were owned by your target organisation, when in fact, they’re not.

Additionally, the whole point of having subnetworks is to have the ability to delegate the administration and/or ownership of a portion of a network to another entity, hence, inheriting the classification label of parent network blocks simply makes no sense. Although, the practice of dividing network blocks into smaller network blocks for the same organization exists, such practice occurs primarily within the context of private IP address spaces.

Parent

[ Back to Table of Contents ]

A network block can fall under other larger network blocks that are owned by the target. The smallest other target-owned network block that is able to contain the network block in question becomes its immediate parent. If no other network blocks can contain the network block in question, then the target-owned registrant name associated with it becomes its parent. If no registrant name is available, it becomes a top-level object.

Acceptability

[ Back to Table of Contents ]

There is no known method to determine the acceptability rating of a network block so far, but you won’t be manually classifying network blocks anyway as I will discuss later on, so I don’t think it will be that big of a deal.

Post-Footprinting

[ Back to Table of Contents ]

The use of network blocks beyond footprinting mostly revolve around subjecting them to various scanning techniques in order to discover new assets that footprinting was not able to find.

Ping Sweeps, Port Scans, Version Scans, and Vulnerability Assessments

[ Back to Table of Contents ]

You can perform port scans and ping sweeps of entire network blocks in order to find live hosts and services.

Upon finding live services, you may subject them to version scanning which detects vulnerable versions of network services off the bat. You’d want to leverage these vulnerable network services later on once you reach the penetration testing step where you try to actually break into the target’s internal corporate network.

You may also want to subject detected web applications to further testing. This would include testing for vulnerable versions of the web server being used, vulnerable versions of SSL, vulnerable versions of the Content Management System being used, vulnerable versions of the framework being used, and other common web application vulnerabilities (see the OWASP Top 10 project).

Domain Names

[ Back to Table of Contents ]

Domain Names are essentially strings defining a zone of administrative authority. By this, we mean that every name that falls under a specific domain name is administered by the organisation owning the domain name in question. These names include subdomains and hostnames (i.e. domain names that translate to IP addresses).

The collection of domain names that one would end up with after the footprinting process can serve as starting points for scanning and enumeration when one arrives in the subsequent phases of penetration testing.

Whois information should be attached to a domain name object upon initialization. An attempt should also be made to authoritatively resolve the domain name to an IP address, and if successful, the corresponding IP whois information and chain of DNS servers involved in the resolution should be attached to the domain name object. Lastly, if a domain name is successfully resolved into an IP address, a quick attempt should be made to see if a TLS/SSL service exists on that address, and if one does exist, the TLS/SSL certificate should be attached to the initialized domain name object as well.

Canonicalization

[ Back to Table of Contents ]

To canonicalize domain names:

  1. Strip leading and trailing spaces
  2. Convert to lower case

The verification of the entire domain name and each zone string should also be performed. Zones can pretty much be described as the “nodes” of a domain name. For example, in the case of the domain name ᴡᴡᴡ.coca-cola.com, its zones would be ᴡᴡᴡ, coca-cola, and com.

Valid zone names typically comply with the following rules:

  1. Must not contain spaces
  2. Must not start with a dash
  3. Must be between 1 to 63 characters

Meanwhile, valid domains names typically comply with the following rules:

  1. All zone names are valid
  2. Must not exceed 253 characters
  3. The top-level domain (i.e. the right most zone, such as com, net, org, etc.) must be in the list of valid top-level domains

This verification process prevents IP addresses and other invalid values from being treated as domain names. This is important because sometimes various assets are scraped off unstructured documents.

Inference

[ Back to Table of Contents ]

From the two pieces of whois information attached to the domain name object upon initialisation, we may be able to infer the following:

  1. Registrant Names
    1. Whois Information
      1. Registrant Organization
      2. Admin Organization
      3. Tech Organization
    2. TLS/SSL Certificate
      1. Subject Organization
  2. Domain Names
    1. Parent Domain Name
    2. Whois Name Servers
    3. DNS Resolution Name Servers
    4. TLS/SSL Certificate Subject Common Name
  3. Email Addresses
    1. Whois
      1. Registrant Email
      2. Admin Email
      3. Tech Email
  4. Network Blocks
    1. Network Block of corresponding IP address

Transforms

[ Back to Table of Contents ]

To discover assets related to a domain name:

  1. Google Transform
    • Sources
      • Google
    • Inputs
      • site:[domain-name]
      • "@[domain-name]"
    • Outputs
      • Domain Names
      • Email Addresses
      • Social Media Accounts
  2. Shodan Transform
    • Sources
      • Shodan
    • Inputs
      • hostname:[domain-name]
    • Outputs
      • Registrant Names
      • Domain Names

Classification

[ Back to Table of Contents ]

The classification inheritance mechanism of domain name objects is probably the most complicated out of all asset types. As such, I will be breaking the process down into subsections in order to make it more understandable.

Manual Classification

[ Back to Table of Contents ]

If the domain name object has been manually classified by the user, it inherits the classification label that the user has assigned to it and then stops immediately.

Whois Registrant Name

[ Back to Table of Contents ]

If a registrant name exists for a domain name and it has a definitive classification label (either “rejected” or “accepted”, and not “unclassified”), then the domain name inherits that classification label and stops immediately.

A domain’s registrant name can be found in the “registrant organization” attribute of the whois information attached to it.

This classification inheritance rule is great for sorting out “organization-level” domain names (i.e. those that are directly enrolled to domain name registrars such as coca-cola.com).

Parent Domain Name

[ Back to Table of Contents ]

If no registrant name exists for a domain name, or no definitive classification label exists for the registrant name associated with a domain name, then the classification inheritance process defers to the classification label of the domain name’s parent, if and only if, the parent is classified as “accepted”.

We add the condition indicated at the end because sometimes domains belonging to a different organization can contain domains that belong to the target organization. For example: the com domain belongs to Verisign, but the coca-cola.com domain belongs to the target. The fact that com is rejected should not imply that coca-cola.com is rejected as well. However, if a domain is accepted, it should imply that all of its subdomains are accepted. This assumption, however, does not work well with organizations that offer IaaS/PaaS services and domain name registrars such as Verisign because that would imply that all domain names under com is owned by Verisign. This is one of the limitations that I have identified with the Inquisitor approach, and I’ve included this note in the “limitations” section of this article.

At this point, you might be asking why we inherit classification labels of parent domain names but not parent network blocks.

The answer is simple: the point of having a hierarchical domain name system is to have the ability to hierarchically organize the network assets of organizations. The network assets that fall under a specific domain generally belongs to the most immediate organization owning that domain because that’s how domains are meant to work in the first place.

In the case of network blocks, public IP address ranges are typically divided so that they can be allocated to another organization, hence, it would make sense that the owner of a parent IP block would usually be an organization distinct from the owner of a child IP block. An organization’s network assets can be hierarchically organized too using network blocks though, but that usually happens within private IP address ranges.

This classification inheritance rule is great for sorting out “service-level” domain names (i.e. those that fall under organization-level domain names such as ᴡᴡᴡ.coca-cola.com or mail.coca-cola.com).

Authoritative Name Server

[ Back to Table of Contents ]

If the parent domain is not classified as accepted, we defer to the classification label of the authoritative name server that has resolved the domain name into an IP address, if and only if, the authoritative name server is classified as “accepted”.

This classification inheritance rule is great for sorting out organization-level domain names that do not fall under an accepted domain name, and aren’t properly registered with an appropriate registrant name or are registered with a registrant name that hasn’t been manually classified as “accepted” yet.

An example of this would be the coca-cola.com.br domain. Notice that it’s registered under the name RECOFARMA IND. DO AMAZONAS LTDA and it doesn’t fall under an accepted domain name, yet the authoritative DNS server that has resolved it is ns3.ko.com which is a DNS server owned by THE COCA-COLA COMPANY.

This inheritance rule will only work if a successful DNS resolution could be made though. It also doesn’t work well with organizations offering domain name registration and IaaS/PaaS services because such organizations occasionally own the authoritative DNS servers which resolve names owned by other entities, hence, introducing false positives in the classification inheritance process.

TLS/SSL Certificate Subject Organization

[ Back to Table of Contents ]

If the authoritative name server is not classified as accepted, we defer to the classification label of the subject organization name (which we will treat as a registrant name) indicated in the TLS/SSL certificate associated with the domain name, if and only if, the classification label in question is “accepted” and it is verified that the certificate is for the host that the domain name resolves to.

To verify if a provided certificate belongs to a domain, you check the DNS resolution chain for CNAMEs and compile all CNAMEs into a list and then add the original domain name to that, and then see if the certificate’s common name match any of the domains in the list.

This inheritance rule will only work if a TLS/SSL service is running on the host that a domain name resolves to.

Parent

[ Back to Table of Contents ]

A domain name can fall under other domain names that are owned by the target. Usually, inferring the parent of a domain name is as simple as removing its leftmost zone label (e.g. the parent of ᴡᴡᴡ.coca-cola.com is coca-cola.com). If the immediate parent of a domain name does not belong to the target however, it defers to the most immediate owned network block where its resolved IP falls under. If there’s no qualified network block or if the domain name cannot be translated into an IP address, then it falls under the registrant name associated with it given that the registrant name is owned by the target. If there is no qualified registrant name, then the domain name becomes a top-level object.

Acceptability

[ Back to Table of Contents ]

The computation of the acceptability rating of an unclassified domain name involves stripping off registry-level zone labels (e.g. com, co.uk, etc.) and then calculating the maximum Needleman–Wunsch similarity score between all accepted domain names and the unclassified domain name being rated.

Here’s a sample of the approach in action using the Needleman–Wunsch GitHub project we used earlier:

>>> alignment.needle('coca-cola', 'coca-colaproductfacts')
Identity = 42.857 percent
Score = 30
coca-col---------a---
coca-col a 
coca-colaproductfacts
>>> alignment.needle('coca-cola', 'pepsico')
Identity = 22.222 percent
Score = -15
coca-cola
 co 
pepsico--
>>>

Post-Footprinting

[ Back to Table of Contents ]

Domain names may be subjected to further discovery efforts by employing various scanning techniques. These techniques go beyond the scope of footprinting and are therefore not included in Inquisitor’s function. Some of the scanning techniques that may be employed on domain names include two processes that I call Service Name Enumeration and Locality Enumeration.

After these scanning steps, domain names may also be subjected to port scans, version scans, and vulnerability assessments in order to find potential entry points into the target organization’s internal network.

Lastly, you can also search for password dumps containing emails under a certain domain name on sites like PasteBin.

Service Name Enumeration

[ Back to Table of Contents ]

Basically, this process involves detecting service names in a domain name and swapping that out with another service name and checking to see if we can resolve the resulting domain name into an IP address.

Let’s take for example that we have the following list of “service names” that we will be using for this process:

  • ftp
  • smtp
  • mail
  • ᴡᴡᴡ
  • ns1
  • ns2

Now, our input for this example shall be the ᴡᴡᴡ.coca-cola.com domain.

Observe the domain name being evaluated. We can see that it contains a service name that is in our list, namely, ᴡᴡᴡ.

Now, as described in the process above, we replace this with the other service names in our list and attempt to resolve them into IP addresses.

  • ftp.coca-cola.com => [no-response]
  • smtp.coca-cola.com => [no-response]
  • mail.coca-cola.com => 94.245.120.86
  • ns1.coca-cola.com => [no-response]
  • ns2.coca-cola.com => [no-response]

And voila! We found a new host, which is mail.coca-cola.com.

Here’s a neat list of service names from Daniel Miessler’s SecLists project if you need a more comprehensive one.

Locality Enumeration

[ Back to Table of Contents ]

This process is essentially the same as service name enumeration except we attempt to find labels indicating a locality in domain names and then swap them out with other locality labels and then testing if they exist by resolving them into IP addresses.

Let’s take for example that we have the following list of “locality labels” that we will be using for this process:

  • co
  • ph
  • nl
  • ch
  • ru

Now, our input for this example shall be the ᴡᴡᴡ.coca-cola.com.ph domain.

Observe the domain name being evaluated. We can see that it contains a locality label that is in our list, namely, ph.

Now, as described in the process above, we replace this with other locality labels in our list and attempt to resolve them into IP addresses.

  • ᴡᴡᴡ.coca-cola.com.co => 184.30.202.52
  • ᴡᴡᴡ.coca-cola.com.nl => [no-response]
  • ᴡᴡᴡ.coca-cola.com.ch => [no-response]
  • ᴡᴡᴡ.coca-cola.com.ru => 194.85.61.76

And now we end up with a couple of new hosts: ᴡᴡᴡ.coca-cola.com.co and ᴡᴡᴡ.coca-cola.com.ru.

Here’s a list of of locality labels maintained by Mozilla. As you’ve noticed, “locality labels” do coincide with top-level domains and “registry-level” domains (we’ll discuss that later).

Email Addresses

[ Back to Table of Contents ]

Email addresses are basically used to identify subjects that can send or receive emails. They’re composed of an identifier and a domain name (i.e. identifier@domain.com). I think we’re all pretty familiar with how this works. Anyway, once you get further into the penetration testing process, you’ll basically want to take all of the email addresses that you have been able to gather and then perform a spear-phishing campaign on them as part of the “gaining access” part of penetration testing.

Canonicalization

[ Back to Table of Contents ]

To canonicalize email addresses:

  1. Strip leading and trailing spaces
  2. Lower-case the domain portion of the email address

The verification of the email address’ syntax should also be performed.

Inference

[ Back to Table of Contents ]

The only information that you can infer from an email address is its domain name.

Transforms

[ Back to Table of Contents ]

To discover assets related to an email address:

  1. Google Transform
    1. Sources
      1. Google
    2. Inputs
      1. "[email-address]"
    3. Outputs
      1. Domain Names
      2. Email Addresses
      3. Social Media Accounts

Classification

[ Back to Table of Contents ]

If the email address has not been manually classified as belonging to the target organization, then it defers to the classification label of the domain name associated with it.

Parent

[ Back to Table of Contents ]

An email address can only fall under a domain name owned by the target, preferably the most immediate one. If this condition cannot be satisfied then the email address becomes a top-level object.

Acceptability

[ Back to Table of Contents ]

Computing the acceptability rating of email address involves inheriting the acceptability rating of the domain name associated with it.

Post-Footprinting

[ Back to Table of Contents ]

The primary post-footprinting purpose of email addresses are to be phishing targets. Aside from that, they may also be used to check for compromised credentials. You can go to sites like HaveIBeenPwned to check if an email address has been compromised in a data breach. You can also search for password dumps that a certain email address has been part of on sites like PasteBin.

Social Media Accounts

[ Back to Table of Contents ]

Accounts from various social media websites such as Facebook, Twitter, and LinkedIn typically fall under the distinction of “social media account”. While, in implementation, different semantic rules, hence, different canonicalization, inference, transform, and classification techniques, would apply to different each type of social media accounts, there are too many of them for me to discuss, so I believe it would be more practical, for the sake of this article, if I generalize them in this section.

Upon initialization of a social media account object (whether it be for LinkedIn or whatever), it would be great if you can manage to scrape off some information about the account from the website or the social media service’s API (if they have one).

One useful method for extracting additional information about a social media account would be to use the Google Search API. It can provide you structured information about a social media account. In the case of LinkedIn, Google Search API can provide you a LinkedIn account’s organization name, role name, location, and other information that you can use in transforms and inferences later on.

Canonicalization

[ Back to Table of Contents ]

Given that each type of social media account follows its own unique semantics, different canonicalization rules would have to apply to each of them as well, but generally, you’d want to use anything unique about each account as their identifier. These unique account identifiers may either be usernames, GUIDs, account numbers, email addresses, etc., and they may follow some unique set of semantics such as being case-insensitive, having a specific length, following a certain regular expression, etc.

Generally, you’d want to take note of the following:

  1. What sort of data does the social media service use to identify accounts?
    • Usernames?
    • Globally Unique Identifiers?
    • Account Numbers?
    • Email Addresses?
  2. What sort of rules do the identifiers follow?
    • Case-sensitivity rules?
      • You should upper-case or lower-case the identifiers if this is the case
    • Length rules?
    • Regular expressions rules?

Let’s take LinkedIn for example. LinkedIn uses case-insensitive usernames to identify users. In cases like these, you’d want to implement the following canonicalisation procedure:

  1. Strip leading and trailing spaces
  2. Convert to lower case

You’d also want to implement a verification mechanism during canonicalisation. It could be as simple as accessing the user’s account by visiting their profile at https://ᴡᴡᴡ.linkedin.com/in/​, and verifying the response of the website. Some social media service websites might return a 404 or some unique message indicating that the account does not exist. In the case of LinkedIn, you’d be redirected to a specific URL https://ᴡᴡᴡ.linkedin.com/in/unavailable, so in this case, you’d want to watch out for a 302 with the redirect URL above.

Inference

[ Back to Table of Contents ]

As with canonicalization, different inference procedures will apply to different types of social media accounts as well. The assets that you would be able to retrieve in the inference process would depend on the information that you were able to scrape from the social media service’s website, API, or Google Search results.

Some of the possible things that you might have retrieved may include:

  1. Email addresses
  2. Hyperlinks
  3. Links to other social media accounts
  4. Organization names

As such, we may be able to infer the following information:

  1. Registrant Names
    1. Retrieved organization names
  2. Domain Names
    1. Retrieved hyperlinks
  3. Email Addresses
    1. Retrieved email addresses
  4. Social Media Accounts
    1. Retrieved links to other social media account

Transforms

[ Back to Table of Contents ]

If a social media object is identified by a username or has a username associated with it, it could be used as a transformation input since users generally tend to use the same username across different platforms, especially if they’re trying to cultivate an online presence. However, using GUIDs or account numbers as transformation inputs simply makes no sense as they’re not guaranteed to make sense outside the context of the service that they exist in.

Anyway, to discover assets related to social media accounts:

  1. Google Transform
    • Source
      • Google
    • Inputs
      • "[social-media-account.username]"
    • Outputs
      • Domain Names
      • Email Addresses
      • Social Media Accounts

Classification

[ Back to Table of Contents ]

Automatic classification can be a really tricky business in the context of social media accounts because of how varied they are (or maybe it’s because I’m trying to generalize a really large and diverse set of object types here). However, there MAY be some cases where automatic classification can be performed.

Social media services such as LinkedIn associate accounts with specific organization names. In these cases, it may be appropriate to treat the organization name as a registrant name and inherit its classification label. The social media service should provide verification of the organization name however, so that not anybody can just claim that they’re organization X. This verification requirement is a given when it comes to internet registries so that’s why we can trust them, but social media does not necessarily verify its registered users and organizations, so we can’t trust them to give us an honest answer off the bat.

If a social media account has an email address associated with it and the service requires the completion of an email address validation procedure prior to allowing users to use the service, the social media account can inherit the classification label of the email address in question. There may be instances where social media accounts registered with email addresses belonging to the target organization are being used for personal use, but in cases such as these, I think it’s still fair game for penetration testers to conduct spear phishing on the owners of these accounts as they are probably associated with the target organization as well. I mean, how would you get a @coca-cola.com email if you don’t work at The Coca-Cola Company? I’d understand if it was an @gmail.com email though.

Parent

[ Back to Table of Contents ]

Unless you keep track of other social media artifacts (groups, pages, etc.), social media accounts usually fall under email addresses or directly under registrant names. The conditions for establishing parental relationships should pretty much be the same as the conditions for inheriting classification labels.

Acceptability

[ Back to Table of Contents ]

A method of determining acceptability may exist for some types of social media accounts. It’ll likely be computed using the Needleman–Wunsch algorithm again using the list of accepted registrant names and a Facebook/LinkedIn/Twitter account’s display name.

Post-Footprinting

[ Back to Table of Contents ]

The post-footprinting applications of social media accounts are similar to those of email addresses. They are to serve as phishing targets, and the usernames associated with them may be checked on sites like HaveIBeenPwned and PasteBin.

Inquisitor Workflow

[ Back to Table of Contents ]

Now that we got the fundamental concepts down, it’s time to put them all together in a single coherent process intended to (ideally) yield you the most comprehensive and accurate footprinting results possible.

Seeding the Intelligence Database

[ Back to Table of Contents ]

In the beginning, there was nothing; and as our homeboi, Parmenides, once pointed out:

ex nihilo nihil fit

Yeah. So basically, if the intelligence database is empty, you won’t really get anywhere as there’s really nothing to apply transforms or inferences to. Without transforms and inferences, you’re pretty much stuck because those are the only two types of automated information gathering mechanisms that Inquisitor is aware of.

Inquisitor can’t just guess which organization you’re targeting. You have to give it initial information. You have to give it “axioms” or “first truths” to go off from. You have to give it a clue on what sort of assets that your target organisation owns. This can be as simple as declaring that the site coca-cola.com belongs to your target organisation, or that the organisation that you’re targeting registers its domains and network blocks under the name The Coca-Cola Company, etc.

You should also look for a list of acquisitions and mergers performed by the company that you’re targeting in order to identify any other websites, or registrant names that you need to include in the seeding phase.

This initial information allows Inquisitor to know what kind of transformation methods to run, what classification labels can be inherited, etc.

Running Transforms

[ Back to Table of Contents ]

Now that we have some initial information from the seeding phase, we can start the process of actually discovering our target’s assets by probing various OSINT sources. This involves running the transformation methods of the asset objects that we provided during the seeding phase.

The mere assertion that coca-cola.com belongs to the target already gives Inquisitor plenty of information to go off from. From this alone, we can determine which OSINT sources we need to probe, what kind of questions we should be asking the source, and what constitutes asset ownership. For example, given that we know coca-cola.com belongs to our target, it may be a good idea to ask Google for domains that fall under coca-cola.com. After that, all domains found falling under coca-cola.com will be automatically classified as “accepted”.

Auditing and Verifying Transform Results

[ Back to Table of Contents ]

After transformation, you’d likely end up with a bunch of unclassified assets inside your intelligence database. These are the assets that Inquisitor has failed to automatically classify, and as such, you are going to have to perform manual classification on them yourself.

I know that going through a huge list of assets and classifying them one-by-one sounds like quite an ordeal, but there is a method to this, and it involves leveraging on the how the classification inheritance mechanism works.

I will explain in the following sections.

Verifying Registrant Names

[ Back to Table of Contents ]

You are going to want to manually inspect and classify registrant name objects first.

Aside from commonly being fewer in number than the rest of the asset types (hence, being less strenuous to classify), the classification inheritance function for most asset types point to the classification label of a registrant name. This means that by manually classifying a registrant name as “owned”, you would effectively be classifying all the domain names and network blocks registered under that name as “owned” as well. This effectively reduces the number of assets that we would need to manually review later on by a lot.

Generally, you’d want to take note of the following steps when reviewing registrant names. For the purpose of this section, let’s say we’re doing footprinting work for Coca-Cola.

Verify Promising Registrant Names

[ Back to Table of Contents ]

What’s that? You have an unclassified Registrant Name in your Intelligence Database labelled The Coca-Cola Company? Hmm. I wonder if this name is in any way related to your target company, which incidentally, is called “Coca-Cola”.

Yeah. You see, during the verification process, you’re going to have to engage in a variety of mentally laborious tasks such as determining whether the name The Coca-Cola Company actually refers to Coca-Cola.

If you think it does, then go ahead and classify that registrant name as “owned”.

However, aside from handling cases with blatantly obvious answers, there are some instances where a registrant name is indeed owned by the organization you’re doing footprinting for even though the name might look like it belongs to a different company such as in the case of registrant name Odwalla Inc..

In this case, the relationship with Coca-Cola isn’t very clear, but upon checking the list of acquisitions of Coca-Cola on Wikipedia, we can see that Odwalla was actually acquired by Coca-Cola in the past, hence, owned by it, meaning that it should fall under our scope as well.

So, yeah. When verifying registrant names, make sure to check if they were part of an acquisition or merger conducted by the company you’re doing footprinting for.

Reject Infrastructure Provider Registrant Names

[ Back to Table of Contents ]

After running transforms, you might end up with some registrant names belonging to various infrastructure providers such as Verisign or Amazon. This will occur because some of the IP addresses and domain names utilized by the websites of your target may fall under the network block of an IaaS provider and the domain of a TLD maintainer.

Make sure to reject these registrant names outright as they definitely have nothing to do with the organization you’re doing footprinting for, unless you’re actually doing footprinting for those organizations (which you shouldn’t be). Rejecting these registrant names removes them from the list of assets that you have yet to review, hence, allowing you to focus your attention on reviewing other unreviewed assets instead.

Verifying Domain Names

[ Back to Table of Contents ]

After you’re done manually inspecting and classifying registrant names, you’re going to want to look at manually inspecting and classifying domain names next.

I am well aware that network blocks usually come next in the de facto hierarchy of asset types. However, no other asset type inherits classification labels from network blocks as of the time this article was written, so we’re better off moving on to domain names for now as plenty of other asset types inherit their classification labels from them.

Generally, you’d want to take note of the following steps when reviewing domain names. For the purpose of this section, let’s say we’re doing footprinting work for Coca-Cola again.

Verify Promising Organization-Level Domain Names

[ Back to Table of Contents ]

Hmmm. Does the domain name coca-cola.com sound like it belongs to a company called Coca-Cola? Perhaps.

I know you hate having to manually classify assets, but in the end, Inquisitor just attempts to derive logical conclusions from the assumptions that you give it, so it can only minimize so much of your work. It can’t assume things for you, though I might build a feature like that sometime in the future.

Anyway, when manually inspecting and classifying domain names, you’d want to pay attention to organization-level domain names first.

What are organization-level domain names you ask? Well, kindly take a look the following list of domain names:

  • com
  • coca-cola.com
  • ᴡᴡᴡ.coca-cola.com

The first one (com) is what I’d call a registry-level domain name. Incidentally, it’s also a top-level domain name. By definition, these domain names belong to domain name registrars and domains for various organizations typically fall directly under them.

The difference between a registry-level domain name and a top-level domain name is that a top-level domain name does not have a parent domain while a registry-level domain name may have a parent in some instances. A lot of countries like the United Kingdom have their registry-level domains in the second-level (e.g. co.uk), therefore making my work here much, much harder.

Essentially, all top-level domain names are registry-level domain names, but not all registry-level domain names are top-level domain names. You can find a list of top-level domain names in this list maintained by IANA. Meanwhile, Mozilla keeps a list of registry-level domain names over here.

The next one (coca-cola.com) is what I’d call an organization-level domain name. By definition, these domain names are the domains registered by organizations to domain name registrars, and registrant organizations typically place their named assets (such as FTP and name servers) under domain names of this level.

We want to consider domains of this level first because by classifying them, we’d effectively be classifying the subdomains, hostnames, and email addresses under them as well, hence, saving us a lot of work later on.

The last one (ᴡᴡᴡ.coca-cola.com) is what I’d call a service-level domain name. By definition, these domain names address specific computing assets in the organization such as mail servers, name servers, file servers, web servers, or any other computing service provided by the organization over a network.

Domain names of this level typically fall under an organization-level domain name though the two levels are not mutually exclusive. Some organization-level domain names can address specific network services as well, hence, effectively becoming service-level domain names themselves.

Note that organizations may register multiple organization-level domain names for their organization so you’re not going to be looking for just one domain name to manually classify.

Reject Infrastructure Provider Domain Names

[ Back to Table of Contents ]

Just like in the case of ending up with infrastructure provider registrant names as discussed earlier, you’d likely end up with infrastructure provider domain names as well after running transformation functions. And just like what you did with infrastructure provider registrant names, you’d likely want to reject infrastructure provider domain names as well.

As of the time that this article was written, Inquisitor does not provide a mechanism for cascading rejection classification labels down to subdomains and other related assets, but the concept is currently in the road map for future developments. The best you could do for now is to create a rule which rejects all subdomains under a specific domain name.

(Not) Verifying IP Blocks and Email Addresses

[ Back to Table of Contents ]

After verifying registrant names and domain names, you’re probably thinking of verifying IP blocks and email addresses next. However, I’d recommend that this process be skipped altogether because aside from the fact that IP blocks and email addresses already inherit the classification labels of their corresponding parents, no other asset types inherit classification labels from either.

Additionally, it’s quite unlikely that you’d find a target-owned email address that falls under an un-owned domain, nor a target-owned IP block that is registered under an un-owned registrant name.

Verify Social Media Accounts

[ Back to Table of Contents ]

After verifying registrant names and domain names, you’d want to take a look at social media accounts next. Though no other asset type inherits their classification label from social media objects as of the time that this piece was written, they are directly part of your target organization’s attack surface so it would be wise to consider them as well.

You might be wondering why I’m asking you to manually verify social media objects but not IP blocks or email addresses. That’s because some social media objects may not exactly have a mechanism of inheriting classification labels from other assets, and thus, they need to be manually inspected.

Running Transforms Again (and Again)

[ Back to Table of Contents ]

After performing manual verification and classification of assets in the previous step, the model of your target organization (as represented by the contents of the intelligence database) will have effectively changed. New assets would probably have been added to the list of assets belonging to your target, hence, transforms that have never been executed before would be available for execution. These new transforms may potentially yield new assets that have not been discovered before.

Basically, you’d want to continue running transforms and performing manual verification over and over again until you encounter a phase where running transforms yield you no new assets. Once this happens, your footprinting process would essentially be over and it would be time for you to wrap up your results by generating reports.

Generating Reports

[ Back to Table of Contents ]

Inquisitor provides you a facility to export the contents of your intelligence database as well as generate a visualization of it. This will give you a map of your target’s public-facing assets in the later phases of penetration testing.

Leveraging OSINT Results

[ Back to Table of Contents ]

Now that you have a map of your target’s public-facing assets, it’s time you actually made use of the information contained within in order to assist you in the later phases of penetration testing.

Each asset can be leveraged in a different manner. For example, IP blocks can provide you a starting point for future ping sweeps and port scans, email addresses and social media accounts can provide you a starting point for future spear phishing attacks, domain names can provide you a starting point for finding vulnerable web applications, etc.

I’ve laid out the possible post-footprinting steps that you can take for each asset type in the “types of assets” section earlier in this article. You can refer to them once you’ve run out of ideas on what to do with your footprinting results.

Limitations

[ Back to Table of Contents ]

As of the time that this article was written, I have identified a couple of limitations that users need to watch out for in order to make sure that they get the best and most accurate results.

Wrong Assumptions, Wrong Results

[ Back to Table of Contents ]

The correctness of automatic classification results fundamentally lie with the truthfulness of the manual classifications you provide the system either through the seeding phase or the auditing and verification phase. That means, if you feed your intelligence database incorrect information, the system will inevitably generate incorrect results as well.

Bad With Certain Organization Types

[ Back to Table of Contents ]

As I mentioned a few times throughout this article, the Inquisitor footprinting approach doesn’t work well with the following organization types:

  1. PaaS/IaaS Providers
  2. Domain Name Registrars

The reasons why Inquisitor is particularly bad with these organization-types is because these organization types frequently defy classification inheritance assumptions, especially surrounding domain names.

For example, these organizations may delegate their subdomains to other entities, hence, breaking the assumption that all subdomains under an organization-owned domain is owned by that organization. Also, these organizations may resolve domain names for other organizations, hence, breaking the assumption that an organization owns a domain name if a DNS server under its control authoritatively resolves it.

Future Developments

[ Back to Table of Contents ]

For the future, I’d probably want to add more OSINT sources (e.g. Netcraft), track more asset types (e.g. documents and more social media accounts), and implement rules-based manual classification.

picoCTF 2017 Write Up

Hello there, ladies and gentlemen. It seems I haven’t written anything on this blog for a while. For those of you who think I’m already dead: I’m not. I’m still here – alive and kicking!

Anyway, today, I’ll be posting my write-up for picoCTF 2017 which closed this last April 14. I didn’t get to work on it as much as I’d like to because I was on a vacation trip in Japan for the most of the month but I did finish a handful of challenges in the little time I got to spend on it.

The website for the picoCTF game is located at https://2017game.picoctf.com/. It looks like the challenges are still solvable at the time that this write-up was written so I may add some additional sections into this article later on after the initial revision.

This CTF is targeted towards high school and college students so the challenges were quite easy as compared to your typical “Crazy Russian Hacker” CTFs. It features a story line presented in a Visual Novel form which is pretty neat and quite unique considering all the CTFs I’ve played thus far. However, I have yet to finish the story at the time that this article was written and I seem to have forgotten some parts of the story already because of how long my vacation was.

I wasn’t alone in trying to solve this CTF though. I was with a team called “HydraSec” and this team is pretty much composed of contributors to a blog I (have yet to) contribute to. The blog is called InfoSecPinas. And with that in mind, do note that I’ll only be writing about the challenges that I have solved and I’ll be skipping the challenges that were solved by someone else.

Alright then, hajimemashou!

Table of Contents

Level 1 – Binary Exploitation 40 (Bash Loop)

[ Back to Table of Contents ]

ctf-picoctf2017-lv1-bin40

OK. So a program that’s thinking of a number. Essentially, it’s a program that generates a random number and then compares the user input to that number before showing the flag.

ctf-picoctf2017-lv1-bin40-fig1.PNG

The approach is pretty straightforward: feed the program every number from 0 to 4096. In order to do this, simply paste the following line of bash code into the terminal and execute it:

cd /problems/e3f1970eb419b3aa32788a43ec91ef08; for i in {0..4096}; do RESULT=`./bashloop "$i"`; if [[ "$RESULT" =~ ^Nope.* ]]; then :; else echo "$RESULT"; break; fi; done;

That should do the trick. After you do this, you’ll be greeted with the following message:

ctf-picoctf2017-lv1-bin40-fig2

The flag is: 9960332950d7db392f97f29dee04f4ee if it wasn’t so blatantly obvious.

Now let’s move on to the next problem.

Level 1 – Binary Exploitation 40 (Just No)

[ Back to Table of Contents ]

ctf-picoctf2017-lv1-bin40-2

Wow. What a selfish program! Looks like this little fella is going to need some “convincing”. Alright, let’s begin. Navigate to the directory that the instruction is pointing at and look for that selfish program.

ctf-picoctf2017-lv1-bin40-2-fig1

Now let’s fire it up.

ctf-picoctf2017-lv1-bin40-2-fig2

Wow. K then fam. Looks like this isn’t going to be as straightforward as I thought. It mentioned something about an auth file so I’m guessing it’s the one named auth from when we did an ls in the directory (no shit Sherlock).

Let’s see what it says:

ctf-picoctf2017-lv1-bin40-2-fig3

Well. There’s our problem. Let’s make that a “yes”.

Obligatory JoJoke. Moving on…

ctf-picoctf2017-lv1-bin40-2-fig4

Well. That didn’t work out. Maybe this is where the provided source code for the binary comes in. Let’s take a look:

ctf-picoctf2017-lv1-bin40-2-fig5

Now take a look at the highlighted section above. That’s our key to convincing the program to show us the flag.

What seems to be happening here is that the program looks for the auth file in a location that is relative to the current context of execution. It doesn’t look for it in an absolute location, so if we could move the current context of execution to a location where an auth file containing “yes” is accessible using the relative path ../../problems/7e8b456c98db60be9a33339ab4509fca/auth, then we could convince the program to show us the flag.

In order to do this, we first need to create an mock location containing an auth file with a “yes” inside. We’re going to have to do this under a directory that our current user account can control. I’d go with the $HOME directory because that usually belongs to the current user.

ctf-picoctf2017-lv1-bin40-2-fig6

After creating the mock location, we now have to put the auth file with a “yes” inside of it inside our mock location.

ctf-picoctf2017-lv1-bin40-2-fig7

Now here’s the important part. We have to move the current execution context in a location where the path ../../problems/7e8b456c98db60be9a33339ab4509fca/auth is accessible. Going inside the ~/problems/7e8b456c98db60be9a33339ab4509fca directory should do the trick.

ctf-picoctf2017-lv1-bin40-2-fig8

Alright. So now, if we try to access the path ../../problems/7e8b456c98db60be9a33339ab4509fca/auth from the current context of execution, we’re going to get a “yes”.

ctf-picoctf2017-lv1-bin40-2-fig9

Now that we have that in place. It’s just a matter of executing the program.

ctf-picoctf2017-lv1-bin40-2-fig10

As you can see above, the flag is ddf649b13d560409d2649dc06f2a43ee.

If you don’t wanna be bothered to read the step-by-step instructions I have laid out above, simply paste the following code into the terminal and you would have solved this problem instantly.

mkdir -p ~/problems/7e8b456c98db60be9a33339ab4509fca; echo 'yes' > ~/problems/7e8b456c98db60be9a33339ab4509fca/auth; cd ~/problems/7e8b456c98db60be9a33339ab4509fca/; /problems/7e8b456c98db60be9a33339ab4509fca/justno;

Level 1 – Reverse Engineering 20 (Hex2Raw)

[ Back to Table of Contents ]

ctf-picoctf2017-lv1-rev20

OK. First reverse engineering problem of the game. Let’s do as it says and go CLI ourselves to the indicated location.

ctf-picoctf2017-lv1-rev20-fig1

Let’s try running the program shall we.

ctf-picoctf2017-lv1-rev20-fig2

Alright. I think I get what it’s getting at. It’s asking us to provide the “raw” form of the hex string that it’s providing us. For example, if you see a 41 in the provided hex string, you have to give it an “A” at that location because the “A” character is represented by the hex value 0x41.

The tricky part of this problem is that some of the values that it’s asking us to provide cannot by typed out on the keyboard. Luckily, there’s this neat little command line tool called echo which has a -e mode. This mode allows us to convert hex strings into to their raw character form.

For example, if we do an echo -e '\x41', the tool is going to print out an “A”.

ctf-picoctf2017-lv1-rev20-fig3

Now all we have to do is echo out the hex string that the hex2raw program is asking of us and then push it into the hex2raw program. How do we do this you ask? By piping of course! Make sure you escape every hex value in the echo command with \x so they would be treated as hex values to be converted into raw form.

ctf-picoctf2017-lv1-rev20-fig4

And now we have the flag which is: 75d3080d00407fa709c18a6cc69d1edc.

As you noticed earlier, the hex2raw program was asking for keyboard input. This is because keyboard input is the default “Standard Input” of all applications. When you pipe the output (i.e. the printed characters) of one program to another, the output of the previous program in the pipe becomes the “Standard Input” of the program in the later part of the pipe.

Anyway, if you don’t wanna be bothered with following each and every step indicated above, you can simply paste the following code into the terminal and the problem will be solved instantly.

echo -e '\x7c\xa6\x71\x67\xdb\x32\x9a\x5d\x15\x08\xcc\x4a\xd5\x38\x06\x78' | /problems/88518d23aee7ee21e50bdd8414a404c1/hex2raw

Level 1 – Reverse Engineering 20 (Raw2Hex)

[ Back to Table of Contents ]

ctf-picoctf2017-lv1-rev20-2

This one is a pretty straightforward problem. The raw2hex program inside the indicated directory simply prints our flag in raw format, and we are going to have to encode that in hex form in order to determine the flag.

Let’s try running the raw2hex program.

ctf-picoctf2017-lv1-rev20-2-fig1

Well, it seems that our console got messed up by running that program. No problem though. We just have to do an exit and reconnect.

Now this time, let’s pipe the output of the program into xxd​ – a command line tool which encodes raw data into a hex string.

ctf-picoctf2017-lv1-rev20-2-fig2

And there’s our flag basically. I omitted the first twelve hex values because those represent the “The flag is:” portion of the string. We’re only interested in the actual flag in this case.

If you didn’t quite get that, the flag is 71c28db77578a80e38aae0d626d853a5.

Again, if you don’t wanna be bothered to follow the step-by-step instructions that I have laid out above, simply paste the following code into the terminal and you should instantly find the flag that you’re looking for:

cd /problems/40b1e663252261e8203962486523629e; ./raw2hex | xxd;

To be honest, I don’t really see how this or the Hex2Raw problem count as reverse engineering problems.

Level 1 – Cryptography 50 (computeAES)

[ Back to Table of Contents ]

ctf-picoctf2017-lv1-crypto50

While I generally hate cryptography problems, I actually had quite a bit of fun doing this one. Alright, so if you click the “clue” link in the problem description, you will be redirected to a text file containing the following text:

Encrypted with AES in ECB mode. All values base64 encoded
ciphertext = rW4q3swEuIOEy8RTIp/DCMdNPtdYopSRXKSLYnX9NQe8z+LMsZ6Mx/x8pwGwofdZ
key = 6v3TyEgjUcQRnWuIhjdTBA==

OK, so it seems that we are already given both the cipher text and the key in base64 format. It’s just a matter of feeding the hex of the raw values of these base64 strings into an online AES decryptor in order to extract our flag.

So to begin, let’s convert the provided base64 strings into their raw value, and then convert those raw values into hex values that we can then paste into an online AES decryptor. We can do this by grabbing our console and running the following commands:

For hex encoding the cipher text:

echo 'rW4q3swEuIOEy8RTIp/DCMdNPtdYopSRXKSLYnX9NQe8z+LMsZ6Mx/x8pwGwofdZ' | base64 --decode | od -A n -t x1
           ad  6e  2a  de  cc  04  b8  83  84  cb  c4  53  22  9f  c3  08
           c7  4d  3e  d7  58  a2  94  91  5c  a4  8b  62  75  fd  35  07
           bc  cf  e2  cc  b1  9e  8c  c7  fc  7c  a7  01  b0  a1  f7  59

For hex encoding the key:

echo '6v3TyEgjUcQRnWuIhjdTBA==' | base64 --decode | od -A n -t x1
           ea  fd  d3  c8  48  23  51  c4  11  9d  6b  88  86  37  53  04

Alright. Now that we have the hex values for our cipher text and encryption key, let’s feed them to an online AES decryptor in order to extract the flag. For this problem, I used the one provided by OnlineDomainTools (http://aes.online-domain-tools.com/). Make sure the “Function” option is set to “AES” and the “Mode” option is set to “ECB”.

ctf-picoctf2017-lv1-crypto50-fig1

As you can see, the flag is flag{do_not_let_machines_win_983e8a2d}.

Level 1 – Miscellaneous 10 (Piazza)

[ Back to Table of Contents ]

ctf-picoctf2017-lv1-misc10

This one isn’t really a challenge. They just want you to register to their forum. For the sake of completeness, I’ll just write about it as well.

Anyway, just click the link, register for an account, and the flag is in one of the stickies of the forum that you’ll eventually end up in.

ctf-picoctf2017-lv1-misc10-fig1.png

As you can see, the flag is flag{ask_and_hop3fully_we_can_help}.

Level 2 – Reverse Engineering 60 (A Thing Called the Stack)

[ Back to Table of Contents ]

ctf-picoctf2017-lv2-rev60

Upon clicking the provided link, your browser will download a piece of assembly code that you have to trace in order to find the answer to this problem. The assembly code in question is pasted in the following code block:

foo:
    pushl %ebp
    mov %esp, %ebp
    pushl %edi
    pushl %esi
    pushl %ebx
    sub $0x90, %esp
    movl $0x1, (%esp)
    movl $0x2, 0x4(%esp)
    movl $0x3, 0x8(%esp)
    movl $0x4, 0xc(%esp)

In this problem, you’re going to have to find the number of bytes between the location of the return address of the current frame and the top of the stack after the code has finished executing.

Now, let’s remember our registers again.

EBP: Points to the base of the current frame, i.e. the location just below the return address of the current frame.

ESP: Points to the top of the stack.

These are the only relevant ones for this problem. Now let’s get to dissecting the code.

Line 1: foo:

In this line, we enter the foo function. It should really be call foo since that’s the actual instruction used to call the foo function from another function. When we enter a function, the return address gets pushed on top of the stack. When we “push” the return address on top of the stack, we mean the following:

  1. Decrease ESP by 4
  2. Write the address to return to (essentially, the current value of the IP or Instruction Pointer) in the address pointed to by ESP

We decrease because the stack generally grows downwards in most systems. We decrease ESP by 4 because we’re assuming a 32-bit system. In a 32-bit system, all addresses are represented by 32-bits or 4 bytes.

Assuming that the initial value of ESP is 0x00000368, current state of the stack will be as follows after this line has finished executing:

ctf-picoctf2017-lv2-rev60-fig1

These addresses are only hypothetical by the way. It’s there just so we can visualise the problem better.

Line 2: pushl %ebp

In this line, we push the current content of the EBP register on top of the stack. The percent sign in the reference to the EBP register is simply compliance to an assembler requirement where percent symbols have to be prepended to register references.

Anyway, this instruction signifies the beginning of the creation of a new frame. The instruction pushl indicates that we are pushing a DWORD onto the stack. As we recall in our college computer science classes, a WORD is two bytes long, and a DWORD is four bytes long.

So essentially, what happens in this instruction is:

  1. Decrease ESP by 4
  2. Write the contents of EBP to the address pointed to by ESP

After this line has finished executing, the current state of the stack will be as follows:

ctf-picoctf2017-lv2-rev60-fig2

Line 3: mov %esp, %ebp

Now, in order to formalise the creation of the new frame, we need to point our EBP register to the beginning of the frame that we are creating. At the moment, the current value of ESP is pointing to where the new frame should be created, so it is only then appropriate that we assign the current value of ESP to EBP.

The address pointing to the beginning of every frame always contains a reference to beginning of the previous frame (i.e. the value of the previous EBP). This is why we pushed EBP on top of the stack in the previous instruction. Once the previous EBP is saved on top of the stack, we update our EBP to point to the base of our current frame. That is what’s happening in this instruction.

So essentially, what happens in this instruction is:

  1. Move the contents of ESP to EBP

After this line has finished executing, the current state of the stack will be as follows:

ctf-picoctf2017-lv2-rev60-fig3

Lines 4, 5, and 6: pushl %edi; pushl %esi; pushl %ebx;

In lines 4, 5, and 6, we’re simply pushing DWORDS on top of the stack. These are allocations for the local variables of the frame. This would be the equivalent of declaring three integer variables in C.

In case you were wondering, function parameters are stored before the return address and they’re not displayed in the stack diagrams that I have placed in the previous sections. For more information on the stack frame format, kindly refer to this image: https://en.wikipedia.org/wiki/Call_stack#/media/File:Call_stack_layout.svg.

So essentially, what’s happening in these instructions are:

  1. Decrease ESP by 4
  2. Write the current contents of EDI to the address pointed to by ESP
  3. Decrease ESP by 4
  4. Write the current contents of ESI to the address pointed to by ESP
  5. Decrease ESP by 4
  6. Write the current contents of EBX to the address pointed by ESP

After these lines have finished executing, the current state of the stack will be as follows:

ctf-picoctf2017-lv2-rev60-fig4

Line 7: sub $0x90, %esp

In this instruction, we’re allocating an array which can contain up to 144 bytes or 36 DWORDs.  If this was C, this would be the equivalent of declaring int[36] var;. The dollar sign in the 0x90 constant is just compliance to an assembler requirement where constants need to have dollar signs prepended onto them.

Essentially, what’s happening in this instruction is:

  1. Subtract ESP by 0x90 or 144

After this instruction, the current state of the stack will be as follows:

ctf-picoctf2017-lv2-rev60-fig5

Lines 8, 9, 10, and 11: movl $0x1, (%esp); movl $0x2, 0x4(%esp); movl $0x3, 0x8(%esp); movl $0x4, 0xc(%esp);

In these series of instructions, we’re just assigning values into the array that we allocated in the previous instruction.

Essentially, what’s happening in this series of instructions is:

  1. Place the value of 0x1 into the first DWORD slot of the array (offset 0 from ESP)
  2. Place the value of 0x2 into the second DWORD slot of the array (offset 4 from ESP)
  3. Place the value of 0x3 into the third DWORD slot of the array (offset 8 from ESP)
  4. Place the value of 0x4 into the fourth DWORD slot of the array (offset 12 from ESP)

Once this series of instructions have finished executing, the current state of the stack will be as follows:

ctf-picoctf2017-lv2-rev60-fig6

At this point, the program has pretty much finished executing and we can now find the number of bytes between the location of the return address of the current frame and the top of the stack.

Alright, so…

The top of the stack is tracked by the ESP register, so that means the top of the stack is located at 0x000002C4.

The return address of the current frame is located at 0x00000364 if you look at the final stack diagram above.

The number of the bytes between the two locations is 0x00000364 - 0x000002C4 = 0x000000A0.

The problem states that we should remove leading zeroes from the answer, so the answer should be just 0xa0.

Therefore, the flag is 0xa0.

If you want to skip the step-by-step instructions that I’ve laid out above, you can just refer to the annotated code I’ve placed in the code section below:

foo:                        # esp -= 0x4 ; *esp =  ; When function is called, return address gets pushed to the stack (4 bytes)
    pushl %ebp              # esp -= 0x4 ; *esp = %ebp ; Push base pointer of parent frame (4 bytes)
    mov %esp, %ebp          # ebp = esp ; Set base pointer of current frame - that is, the current top of the stack at this point
    pushl %edi              # esp -= 0x4 ; *esp = %edi ; Push integer value in %edi
    pushl %esi              # esp -= 0x4 ; *esp = %esi ; Push integer value in %esi
    pushl %ebx              # esp -= 0x4 ; *esp = %ebx ; Push integer value in %ebx
    sub $0x90, %esp         # esp -= 0x90 ; Move up the stack by 144 bytes (lower address, higher up in stack)
    movl $0x1, (%esp)       # *(esp + 0x00) = 0x00000001 ; Place 0x1 as 1st element in array (length: 4 bytes)
    movl $0x2, 0x4(%esp)    # *(esp + 0x04) = 0x00000002 ; Place 0x2 as 2nd element in array (length: 4 bytes)
    movl $0x3, 0x8(%esp)    # *(esp + 0x08) = 0x00000003 ; Place 0x3 as 3rd element in array (length: 4 bytes)
    movl $0x4, 0xc(%esp)    # *(esp + 0x0c) = 0x00000004 ; Place 0x4 as 4th element in array (length: 4 bytes)

# a = value of esp at the end of the code = ebp - 0x9c
# b = location of the saved return address = ebp + 0x4
# x = b - a
# x = (ebp + 0x4) - (ebp - 0x9c)
# x = (ebp + 0x4) + (-ebp + 0x9c)
# x = 0x4 + 0x9c
# x = 0xa0
# flag = 0xa0

Level 2 – Reverse Engineering 75 (Programmers Assemble)

[ Back to Table of Contents ]

ctf-picoctf2017-lv2-rev75

This problem provides yet another piece of assembly code that we need to trace. For the convenience of both of us, I’ve pasted the assembly code in question in the following code block:

.global main

main:
    mov $XXXXXXX, %eax
    mov $0, %ebx
    mov $0x4, %ecx
loop:
    test %eax, %eax
    jz fin
    add %ecx, %ebx
    dec %eax
    jmp loop
fin:
    cmp $0x6ed0, %ebx
    je good
    mov $0, %eax
    jmp end
good:
    mov $1, %eax
end:
    ret

Alright. So the problem is asking us to determine the appropriate value for the missing constant in line 4 . The required value must allow the function to execute in such a way that it would return a value of 1.

Before we move on, we need to know a couple of fun facts about returning values from functions first.

In the assembly code generated by gcc, the EAX register is typically designated as the “return register”. That means that when the ret instruction is executed after a function finishes executing, the code in the context of the previous frame treats the contents of the EAX register as the return value for the function that was just invoked. That’s why the good section in the piece of code I pasted above moves the value of 1 into the EAX register.

Knowing this, it is then our objective to know the appropriate value of the missing constant so that the good section of the code gets executed.

In order to simplify our plight, I’ve added some pseudo-code annotations into the assembly code that was provided to us. You can find them in the code section below:

.global main

main:
    mov $XXXXXXX, %eax   # eax = 
    mov $0, %ebx         # ebx = 0x0
    mov $0x4, %ecx       # ecx = 0x4
loop:
    test %eax, %eax      # zf = eax == 0
    jz fin               # if zf: goto fin
    add %ecx, %ebx       # ebx += ecx
    dec %eax             # eax -= 1
    jmp loop             # goto loop
fin:
    cmp $0x6ed0, %ebx    # eq = ebx == 0x6ed0
    je good              # if eq: goto good
    mov $0, %eax         # eax = 0
    jmp end              # goto end
good:
    mov $1, %eax         # eax = 1
end:
    ret                  # return eax

Now that makes things a little easier, don’t you agree?

Alright. Now if you look at each of the defined sections in the code, you’ll find that they’re pretty much just doing the following:

  1. The main section simply initialises the registers that we will be using
  2. The loop section repeatedly adds the value of ECX to EBX, EAX number of times
  3. The fin section checks if EBX is equal to 0x6ed0 and returns 1 (via the good section) if it is. If it’s not, it returns 0
  4. The good section does the actual assignment of 1 to EAX before returning

Now we have a better idea of what we need to put in the missing constant.

Looking at the description of the loop and fin sections, we can find out that EAX needs to contain the number of 0x4s (the value of ECX never changes) that need to be accumulated into the EBX register so that the value inside the EBX register would be equal to 0x6ed0.

The equation essentially is: 0x04 * eax = 0x6ed0.

Now let’s do a little algebra:

0x04 * eax = 0x6ed0
eax = 0x6ed0 / 0x04
eax = 0x1bb4

The value of EAX must be 0x1bb4 so that EBX would end up with a value of 0x6ed0. This in turn would allow the function to return a value of 1.

Therefore, our flag is 0x1bb4.

Level 2 – Binary Exploitation 75 (I’ve Got a Secret)

[ Back to Table of Contents ]

ctf-picoctf2017-lv2-bin75

This problem gives you a service that you have to exploit and it also provides you the source code for the service in question as a download. In a nutshell, the service generates a random secret value that you have to guess in order for the service to give you the flag.

For our convenience, I’ve pasted the code that the problem provides in the code section below:

#include 
#include 
#include 
#include 

#define BUF_LEN 64
char buffer[BUF_LEN];

int main(int argc, char** argv) {
    int fd = open("/dev/urandom", O_RDONLY);
    if(fd == -1){
        puts("Open error on /dev/urandom. Contact an admin\n");
        return -1;
    }
    int secret;
    if(read(fd, &secret, sizeof(int)) != sizeof(int)){
        puts("Read error. Contact admin!\n");
        return -1;
    }
    close(fd);
    printf("Give me something to say!\n");
    fflush(stdout);
    fgets(buffer, BUF_LEN, stdin);
    printf(buffer);

    int not_secret;
    printf("Now tell my secret in hex! Secret: ");
    fflush(stdout);
    scanf("%x", &not_secret);
    if(secret == not_secret){
        puts("Wow, you got it!");
        system("cat ./flag.txt");   
    }else{
        puts("As my friend says,\"You get nothing! You lose! Good day, Sir!\"");
    }

    return 0;
}

If you look at line 24 where it says printf(buffer);, you’d immediately know that this program has a format string injection vulnerability. This vulnerability will allow us to pry into the contents of stack memory and extract the value of the secret variable.

Now let’s go over a brief rundown of what a format string injection vulnerability is.

You know those times when you print a variable using printf? Yeah, you use format strings to do that. For example, printf("%s", string);. The first parameter of printf is the format string and the second parameter is the value to be included in the formatted result.

References to values in a format string are indicated by % in printf and each reference to a value is retrieved further down the stack the further it’s ordinal position in the format string is. If you have too many value references in printf and not enough parameters, printf will still look further down the stack for a value it could work with, hence, printing the contents of memory. If we then allow users to arbitrarily enter format strings to print, they can create any number of value references that they wish, which would lead to arbitrary memory reading.

This is what we’re going to leverage in order to acquire the secret value.

So if you look at lines 21 to 24, you’d see that our point of entry is where the service asks the user to “Give me something to say!”. Knowing this, let’s open up a connection to the service and give it what it wants.

ctf-picoctf2017-lv2-bin75-fig1

OK. Let’s try putting in some format strings in there.

ctf-picoctf2017-lv2-bin75-fig2.PNG

Awesome. So what you’re seeing here is the six top-most DWORDs of the stack. This actually contains the value that we want but I will be explaining a bit later on which value it is and why. As you can see above, we can arbitrarily read stack values if we are allowed to inject a format string into printf.

Alright. So in order to retrieve the secret value, we must understand the stack frame format yet again.

By R. S. Shaw – Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=1956587

As you can see, from the top of the stack, you get to access local variables of the current function first, then the return address of the current function, and then the parameters of the current function, and only then would you start encountering the locals of the parent function.

Now. When the program enters printf, it passes the format string and parameters to another function called vprintf which does the actual retrieval and printing of stack values.

ctf-picoctf2017-lv2-bin75-fig3
https://github.com/lattera/glibc/blob/master/stdio-common/printf.c

So, at the moment that vprintf is called, the first %x in the format string that we supplied above actually gave us the value of the done variable which is 0x40. The second %x in the format string gave us the value of the arg variable which is 0xf7fc7c20. And since there are no more local variables, the third %x actually gave us the return address of the printf function which is 0x8048792. Finally, since there is one parameter that was provided to the printf function (this is in the context of the printf called in line 24 by the way) the fourth %x actually gave us the value of the buffer variable which is 0x1. This one actually gave me some confidence about my mind map of the stack because the buffer variable is a global variable and is therefore stored on the data section of the program, and that’s generally on lower memory addresses.

Now since we’ve already went through the locals of printf, it’s return address, and parameters, the next values returned by the format string shall be those of the parent function’s (which is main).

So the main function contains three local variables, namely not_secret, secret, and fd. You already know how this works. The fifth %x of the format string that we provided gave us the value of the not_secret variable which is 0xffffdd34 (it’s uninitialised at this point), and finally, the sixth %x of the format string  gave us the value of the secret variable, which is 0x2a63a11a.

Now all that’s left to do is tell the service it’s secret and we’d have the flag right in our hands.

ctf-picoctf2017-lv2-bin75-fig4.PNG

The flag is a18450ba7aaa8c085c522cdef6ab35ab.

Level 2 – Binary Exploitation 95 (VR Gear Console)

[ Back to Table of Contents ]

ctf-picoctf2017-lv2-bin95.PNG

Alright. For this problem, just cd over to the specified folder in the web console and execute the vrgearconsole program.

ctf-picoctf2017-lv2-bin95-fig1.PNG

Reading the flag.txt file leads to an “Access Denied” message so I’m not going to bother posting the screenshot.

Anyway, it seems that in this challenge, we’re going to have to exploit some input validation problem yet again. Good thing they provided us the source code! I’ve pasted it onto the code section below for our convenience:

#include  
#include  

int login() {
    int accessLevel = 0xff;
    char username[16];
    char password[32];
    printf("Username (max 15 characters): ");
    gets(username);
    printf("Password (max 31 characters): ");
    gets(password);

    if (!strcmp(username, "admin") && !strcmp(password, "{{ create_long_password() }}")) 
{
        accessLevel = 2;
    } else if (!strcmp(username, "root") && !strcmp(password, "{{ create_long_password() 
}}")) {
        accessLevel = 0;
    } else if (!strcmp(username, "artist") && !strcmp(password, "my-password-is-secret"))
 {
        accessLevel = 0x80;
    }

    return accessLevel;
}

int main(int argc, char **argv) {
    setbuf(stdout, NULL);
    printf(
        "+----------------------------------------+\n"
        "|                                        |\n"
        "|                                        |\n"
        "|                                        |\n"
        "|                                        |\n"
        "|  Welcome to the VR gear admin console  |\n"
        "|                                        |\n"
        "|                                        |\n"
        "|                                        |\n"
        "|                                        |\n"                                   
        "+----------------------------------------+\n"                                   
        "|                                        |\n"                                   
        "|      Your account is not recognized    |\n"                                   
        "|                                        |\n"                                   
        "+----------------------------------------+\n"                                   
        "\n\n\n\n"                                                                       
        "Please login to continue...\n\n\n"                                              
    );                                                                                   
    int access = login();                                                                
                                                                                         
    printf("Your access level is: 0x%08x\n", access);                                    
                                                                                         
    if (access >= 0xff || access <= 0) {                                                 
        printf("Login unsuccessful.\n");                                                 
        exit(10);                                                                        
    } else if (access < 0x30) {                                                          
        printf("Admin access granted!\n");                                               
        printf("The flag is in \"flag.txt\".\n");                                        
        system("/bin/sh");                                                               
    } else {                                                                             
        printf("Login successful.\n");                                                   
        printf("You do not have permission to access this resource.\n");                 
        exit(1);                                                                         
    }                                                                                    
}

Alright. Looking at the main function, it seems that the program will give us access to the flag if we can set the access variable to a value that is less than 0x30 but more than 0x00. If we look at where the access variable is set, we can see that it is set to the value returned by the login function. It’s very likely then that our point of entry will lie somewhere in that function.

Now let’s take a look at the login function. It’s very obvious from the get-go that this is a stack-smashing problem since a couple of calls to the gets function is in there.

If you haven’t heard yet, the gets function takes an input from the Standard Input (usually the console) and places it onto the specified buffer (an array in this case) without any sort of boundary checking. This means that if the input is larger than the destination buffer, the input will spill-over into an area beyond the intended destination, usually onto other local variables in the current stack frame. This allows for arbitrary overwriting of local variables if those local variables are in a lower part of the stack relative to the buffer being written onto.

In this case, the accessLevel variable is in a lower part of the stack relative to either the username and password arrays since it was defined first, therefore calls to the gets function eyeing those arrays as destination buffers have the potential of overwriting the accessLevel variable given that we provide the call to gets the appropriate amount of characters via console.

Let’s try inducing a buffer overflow in the username array in this case since it’s closer the the accessLevel variable.

So in order to induce a buffer overflow, we just need to provide 16 characters when we are prompted for a username, and then append an additional four characters to that which will spill over into the accessLevel variable.

The raw value of the concatenation of the additional four characters must have a value that is less than 0x30 but greater than 0, but since the accessLevel variable was already initialised to 0xff, it’s “upper” three bytes have been set to zero already and so we just need to overwrite the portion of accessLevel that was set to 0xff.

If you recall your computer architecture class, most systems operate under a “big endian” scheme where the lower part of an int is stored on a lower memory address and the bigger part of an int is stored on a higher memory address (see https://en.wikipedia.org/wiki/Endianness#/media/File:Big-Endian.svg).  This means that gets will write onto the lower part of accessLevel first, therefore, we just need to append an additional character into the input since just one character is enough to overwrite the part of accessLevel that was set to 0xff. Do note that this one character must have an ASCII value that is less than 0x30. In this case, we will be using the slash (“/”) character whose ASCII value is 0x2f.

In order to obtain access to the flag, we just provide the program an input of aaaaaaaaaaaaaaaa/ for the username, and an empty password, and we’d have privileged access to the system that can read the flag.txt file.

ctf-picoctf2017-lv2-bin95-fig2

The flag is 295aa5afa0b825072f47c9d40b49cc6f.

Level 2 – Cryptography 75 (SoRandom)

[ Back to Table of Contents ]

ctf-picoctf2017-lv2-crypto75

Long story short: connecting to the provided service gives us a string which goes: Unguessably Randomized Flag: BNZQ:2m8807395d9os2156v70qu84sy1w2i6e. It is suggested that this string contains the flag but it has been encrypted in some way.

The provided Python file shows us the algorithm for encrypting the flag. I’ll paste it into the following code section for our convenience:

#!/usr/bin/python -u
import random,string

flag = "FLAG:"+open("flag", "r").read()[:-1]
encflag = ""
random.seed("random")
for c in flag:
  if c.islower():
    #rotate number around alphabet a random amount
    encflag += chr((ord(c)-ord('a')+random.randrange(0,26))%26 + ord('a'))
  elif c.isupper():
    encflag += chr((ord(c)-ord('A')+random.randrange(0,26))%26 + ord('A'))
  elif c.isdigit():
    encflag += chr((ord(c)-ord('0')+random.randrange(0,10))%10 + ord('0'))
  else:
    encflag += c
print "Unguessably Randomized Flag: "+encflag

OK. So looking at the script, it looks like a Caesar Cipher where every character is shifted a random number of times.

Since the random module has been provided a deterministic seed, the random.randrange function is going to give a deterministic sequence of values as well. That means the key for decrypting the first character of the encrypted flag is the result of the first call to random.randrange given that random has been seeded with the string “random” beforehand.

Our algorithm is pretty much going to be the following:

random.seed("random")
target = 'BNZQ:2m8807395d9os2156v70qu84sy1w2i6e'

flag = ''
for t in target:
    k = key(t)
    c = decrypt(t, k)
    flag += c

print flag

The key function returns the appropriate key for the provided character at that index.

def key(t):
    if t.islower():
        return random.randrange(0, 26)
    elif t.isupper():
        return random.randrange(0, 26)
    elif t.isdigit():
        return random.randrange(0, 10)
    return None

And the decrypt function pretty much subtracts the key from the cipher-text as opposed to adding it like what was done during encryption.

def decrypt(t, k):
    if t.islower():
        return chr((ord(t)-ord('a')-k)%26 + ord('a'))
    elif t.isupper():
        return chr((ord(t)-ord('A')-k)%26 + ord('A'))
    elif t.isdigit():
        return chr((ord(t)-ord('0')-k)%10 + ord('0'))
    else:
        return t

The final decryption script is as follows:

#!/usr/bin/python
import random

def key(t):
    if t.islower():
        return random.randrange(0, 26)
    elif t.isupper():
        return random.randrange(0, 26)
    elif t.isdigit():
        return random.randrange(0, 10)
    return None

def decrypt(t, k):
    if t.islower():
        return chr((ord(t)-ord('a')-k)%26 + ord('a'))
    elif t.isupper():
        return chr((ord(t)-ord('A')-k)%26 + ord('A'))
    elif t.isdigit():
        return chr((ord(t)-ord('0')-k)%10 + ord('0'))
    else:
        return t

random.seed("random")
target = 'BNZQ:2m8807395d9os2156v70qu84sy1w2i6e'

flag = ''
for t in target:
    k = key(t)
    c = decrypt(t, k)
    flag += c

print flag

Make sure you run this on the picoCTF shell-web machine because the platform and Python versions also affect the random module. I saved this script in my instance of the shell-web machine and named it sorandomdecrypt.py.

ctf-picoctf2017-lv2-crypto75-fig1

And there we go! The flag is 9b6098160b2ca5139c83fe29fd7c9e5d.

Level 3 – Reverse Engineering 165 (Much Ado About Hacking)

[ Back to Table of Contents ]

ctf-picoctf2017-lv3-rev165

This one is a pretty fun challenge.

Basically the problem hands you a piece of source code written in the Shakespeare Programming Language. It’s one of those weird esoteric programming languages that was probably written as a joke.

Anyway, at first sight, the file looks like a simple text file containing the script for a stage play, but if you thoroughly read the summary of the language in Wikipedia, you’ll get an idea on what sort of rules exist in the language for defining instructions, variables, conditions, sections, etc.

Once we take a closer look at the logic of the program, we’ll find that it actually asks for input and prints output onto the screen. The goal of the challenge is to come up with an input string which will generate the output: tu1|\h+&g\OP7@% :BH7M6m3g=.

For our convenience, I’ve pasted the source code of the program in the code section below:

Much Ado About Hacking.

Benedick, a budding young hacker.
Beatrice, a veteran exploiter.
Don Pedro, a good friend of the others.
Don John, he is just kinda there.
Achilles,  I thought he was from Greece.
Cleopatra, now this is just getting ridiculous.

		   
					  Act I: Also the last act.

				 Scene I: Benedick learns his place.

[Enter Beatrice and Don John]

Beatrice:
You are nothing!

[Exit Don John]
[Enter Don Pedro]

Beatrice:
You are nothing!

[Exit Don Pedro]
[Enter Achilles]

Beatrice: 
You are as proud as a bold brave gentle noble amazing hero. 

[Exit Achilles]
[Enter Cleopatra]

Beatrice:
You are as vile as the difference between a beautiful gentle bold fair peaceful sunny lovely flower and Achilles. 

[Exit Cleopatra]
[Enter Benedick]

Beatrice:
You are nothing!

			  Scene II: Benedick strengthens his memory.

Beatrice: 
Open your mind! Remember yourself.

Benedick:
You are as red as the sum of yourself and a tree. 
Am I as lovely as a cunning charming honest peaceful bold pony?

Beatrice:
If not, let us return to scene II.

Benedick: 
You are as worried as the sum of yourself and a Microsoft.

Beatrice:
Recall your father's disappointment!

		Scene III: Benedick teaches his friends about hacking.

Beatrice:
Recall your latest blunder! You are as smelly as the difference between yourself and Achilles.  

Benedick: 
You are as disgusting as the sum of yourself and a flirt-gill!

[Exit Beatrice]
[Enter Don John]

Benedick:
Thou art as damned as I.

[Exit Don John]
[Enter Don Pedro]

Don Pedro:
You are as rotten as the sum of yourself and myself. 
Thou art as normal as the remainder of the quotient between thyself and Cleopatra.

[Exit Benedick]
[Enter Don John]

Don John:
You are as good as I.

[Exeunt]
[Enter Beatrice and Benedick]

Beatrice:
You are as foul as the sum of yourself and Achilles. Speak your mind!

Benedick:
Are you better than nothing?

Beatrice:
If so, let us return to scene III.

[Exeunt]

Now, upon analyzing the source code above, one of the most notable findings we’ll see is that Beatrice is actually pretty freaking savage.

Anyway, in order to make it easier to understand the logic of the Shakespeare program, we can use this nifty little tool called spl2c which converts Shakespeare programs into C programs. The tool can be found in this StackOverflow thread as pointed out by the “Hints” section of the problem.

Once you get that setup, you can then convert the Shakespeare program into C and you’ll end up with an output similar to the one in the code section below. To make it easier to understand, I added some Pythonic annotations to the code.

/********************************************************************
 *   This C program was generated by spl2c, the Shakespeare to C    *
 *          converter by Jon Åslund and Karl Hasselström.           *
 ********************************************************************/

/* libspl definitions and function prototypes */
#include "spl.h"

int main(void)
{
  /******************************************************************
   * MUCH ADO ABOUT HACKING                                         *
   ******************************************************************/
  
  CHARACTER *benedick;                    /* a budding young hacker */
  CHARACTER *beatrice;                    /* a veteran exploiter */
  CHARACTER *don_pedro;                   /* a good friend of the others */
  CHARACTER *don_john;                    /* he is just kinda there */
  CHARACTER *achilles;                    /* I thought he was from Greece */
  CHARACTER *cleopatra;                   /* now this is just getting ridiculous */
  
  int comp1, comp2;
  
  global_initialize();
  
  benedick = initialize_character("Benedick");
  beatrice = initialize_character("Beatrice");
  don_pedro = initialize_character("Don Pedro");
  don_john = initialize_character("Don John");
  achilles = initialize_character("Achilles");
  cleopatra = initialize_character("Cleopatra");
  
  act_i:                                  /* Also the last act */
  
  act_i_scene_i:                          /* Benedick learns his place */
  
  // don_john.value = 0
  enter_scene(15, beatrice);
  enter_scene(15, don_john);
  
  activate_character(20, beatrice);
  assign(18, second_person, 0);
  
  exit_scene(20, don_john);
  
  // don_pedro.value = 0
  enter_scene(21, don_pedro);
  
  activate_character(26, beatrice);
  assign(24, second_person, 0);
  
  exit_scene(26, don_pedro);
  
  // achilles.value = 32
  enter_scene(27, achilles);
  
  activate_character(32, beatrice);
  assign(30, second_person, 2*2*2*2*2*1);
  
  exit_scene(32, achilles);

  // cleopatra.value = 128 - achilles  
  enter_scene(33, cleopatra);
  
  activate_character(38, beatrice);
  assign(36, second_person, int_sub(36, 2*2*2*2*2*2*2*1, achilles->value));
  
  exit_scene(38, cleopatra);
  
  // benedick.value = 0
  enter_scene(39, benedick);
  
  activate_character(44, beatrice);
  assign(42, second_person, 0);
  
  act_i_scene_ii:                         /* Benedick strengthens his memory */
  
  // benedick.value = getchar()
  // benedick.push(benedick.value)
  activate_character(49, beatrice);
  char_input(47, second_person);
  push(47, second_person, value_of(47, second_person));
  
  // beatrice.value = beatrice.value + 1
  activate_character(53, benedick);
  assign(50, second_person, int_add(50, value_of(50, second_person), 1));

  // flag = benedick.value == 32
  comp1 = value_of(51, first_person);
  comp2 = 2*2*2*2*2*1;
  truth_flag = (comp1 == comp2);
  
  // if not flag: goto act_i_scene_ii
  activate_character(56, beatrice);
  if (!truth_flag) {
    goto act_i_scene_ii;
  }
  
  // beatrice.value = beatrice.value - 1
  activate_character(59, benedick);
  assign(57, second_person, int_add(57, value_of(57, second_person), (-1)));
  
  // benedick.value = benedick.pop()
  activate_character(62, beatrice);
  pop(60, second_person);
  
  act_i_scene_iii:                        /* Benedick teaches his friends about hacking */
  
  // benedick.value = benedick.pop()
  // benedick.value = benedick.value - achilles.value
  activate_character(67, beatrice);
  pop(65, second_person);
  assign(65, second_person, int_sub(65, value_of(65, second_person), achilles->value));
  
  // beatrice.value = beatrice.value - 1
  activate_character(70, benedick);
  assign(68, second_person, int_add(68, value_of(68, second_person), (-1)));
  
  exit_scene(70, beatrice);
  
  enter_scene(71, don_john);
  
  // don_john.value = benedick.value
  activate_character(76, benedick);
  assign(74, second_person, value_of(74, first_person));
  
  exit_scene(76, don_john);
  
  enter_scene(77, don_pedro);
  
  // benedick.value = benedick.value + don_pedro.value
  // benedick.value = benedick.value % cleopatra.value
  activate_character(83, don_pedro);
  assign(80, second_person, int_add(80, value_of(80, second_person), value_of(80, first_person)));
  assign(81, second_person, int_mod(81, value_of(81, second_person), cleopatra->value));
  
  exit_scene(83, benedick);
  
  enter_scene(84, don_john);
  
  // don_pedro.value = don_john.value
  activate_character(89, don_john);
  assign(87, second_person, value_of(87, first_person));
  
  exit_scene_all(89);
  
  enter_scene(90, beatrice);
  enter_scene(90, benedick);
  
  // benedick.value = benedick.value + achilles.value
  // print (char) benedick.value
  activate_character(95, beatrice);
  assign(93, second_person, int_add(93, value_of(93, second_person), achilles->value));
  char_output(93, second_person);
  
  // flag = beatrice.value > 0
  activate_character(98, benedick);
  comp1 = value_of(96, second_person);
  comp2 = 0;
  truth_flag = (comp1 > comp2);
  
  // if flag: goto act_i_scene_iii
  activate_character(101, beatrice);
  if (truth_flag) {
    goto act_i_scene_iii;
  }
  
  exit_scene_all(101);
  
  return 0;
}

After that, I simplified it further by turning the annotations and some parts of the C program into Pythonic pseudo-code:

  benedick = initialize_character("Benedick");
  beatrice = initialize_character("Beatrice");
  don_pedro = initialize_character("Don Pedro");
  don_john = initialize_character("Don John");
  
  act_i:
  
    act_i_scene_i:
    
      // initialize variables to zero
      don_john.value = 0
      don_pedro.value = 0
      benedick.value = 0
    
    act_i_scene_ii:
      
      // get input string and stroe in benedick stack; terminate with space
      // store space character in benedick value
      // store input string length in beatrice value exclusive of space
      benedick.value = read()
      benedick.push(benedick.value)
      beatrice.value++
      if not benedick.value == 32:
        goto act_i_scene_ii
      beatrice.value--
      benedick.value = benedick.pop()
    
    act_i_scene_iii:
    
      // pop last character and decrease string length
      // adjust character
      benedick.value = benedick.pop()
      benedick.value-= 32
      beatrice.value--

      // remember original adjusted value of current character
      don_john.value = benedick.value

      // add current adjusted character to previous adjusted character and modulo by 96
      benedick.value+= don_pedro.value
      benedick.value%= 96

      // remember original value of current adjusted character for next operation
      don_pedro.value = don_john.value

      // unadjust character
      benedick.value+= 32

      // print the value of the current character after manipulation
      print (char) benedick.value

      // if some characters are remaining, handle them in another iteration
      if beatrice.value > 0:
        goto act_i_scene_iii

Alright. So, in the only “act” (essentially, a goto label) in the program, there are three scenes (which are also goto labels).

The first scene simply initializes the values of the variables that exist in the program.

The second scene reads a string from the user and stores it on Benedick’s mind while Beatrice tracks the length of the string from the user.

What do I mean by that, you ask? Well first of all, all variables in the Shakespeare language can only be play characters from Shakespeare’s work. Each character has a value attribute and a stack attribute. You can directly assign values to a character’s value attribute and you can also push values into their stack attribute. If you pop a character’s stack, the top most value in their stack gets removed and then gets assigned to their value attribute. Now, in the language syntax, you tell characters to “recall” something in order to pop their stack, and you tell them to “remember” in order to push into their stack. That’s why I referred to Benedick’s stack as his “mind”. Additionally, in order to ask the user for input, you tell characters to “open their mind”.

The third scene is where it gets tricky. Upon analyzing the Pythonic pseudo-code I wrote above, I believe that it does some sort of chained Caesar cipher on each of the characters in the input string starting from the last character up until the first character. By “chained”, I mean that the offset for shifting each character depends on the difference between the ASCII value of the last character and 32 (space).

Essentially, the formula for getting one character of output is: output[n] = (((input[n]-32) + (n>0)?(input[n-1]-32):0) % 96) + 32.

After it does the cipher for one character, the program checks if there are anymore characters to shift. If yes, it repeats the third scene and pops the next character to encrypt from Benedick’s stack. Else, the program exits.

So, again, our desired result is tu1|\h+&g\OP7@% :BH7M6m3g=. In order to get each of these characters, we just need to substitute them into the formula we defined above.

The following code section displays my solutions for finding each of the input characters required to generate the desired output:

't' ; 116 = (((input[00]-32) + 0) % 96) + 32        ; input[00] = 116 ; input = 't' ;
'u' ; 117 = (((input[01]-32) + (116-32)) % 96) + 32 ; input[01] = 033 ; input = '!' ;
'1' ; 049 = (((input[02]-32) + (033-32)) % 96) + 32 ; input[02] = 048 ; input = '0' ;
'|' ; 124 = (((input[03]-32) + (048-32)) % 96) + 32 ; input[03] = 108 ; input = 'l' ;
'\' ; 092 = (((input[04]-32) + (108-32)) % 96) + 32 ; input[04] = 112 ; input = 'p' ;
'h' ; 104 = (((input[05]-32) + (112-32)) % 96) + 32 ; input[05] = 120 ; input = 'x' ;
'+' ; 043 = (((input[06]-32) + (120-32)) % 96) + 32 ; input[06] = 051 ; input = '3' ;
'&' ; 038 = (((input[07]-32) + (051-32)) % 96) + 32 ; input[07] = 115 ; input = 's' ;
'g' ; 103 = (((input[08]-32) + (115-32)) % 96) + 32 ; input[08] = 116 ; input = 't' ;
'\' ; 092 = (((input[09]-32) + (116-32)) % 96) + 32 ; input[09] = 104 ; input = 'h' ;
'O' ; 079 = (((input[10]-32) + (104-32)) % 96) + 32 ; input[10] = 103 ; input = 'g' ;
'P' ; 080 = (((input[11]-32) + (103-32)) % 96) + 32 ; input[11] = 105 ; input = 'i' ;
'7' ; 055 = (((input[12]-32) + (105-32)) % 96) + 32 ; input[12] = 078 ; input = 'N' ;
'@' ; 064 = (((input[13]-32) + (078-32)) % 96) + 32 ; input[13] = 114 ; input = 'r' ;
'%' ; 037 = (((input[14]-32) + (114-32)) % 96) + 32 ; input[14] = 051 ; input = '3' ;
' ' ; 032 = (((input[15]-32) + (051-32)) % 96) + 32 ; input[15] = 109 ; input = 'm' ;
':' ; 058 = (((input[16]-32) + (109-32)) % 96) + 32 ; input[16] = 077 ; input = 'M' ;
'B' ; 066 = (((input[17]-32) + (077-32)) % 96) + 32 ; input[17] = 117 ; input = 'u' ;
'H' ; 072 = (((input[18]-32) + (117-32)) % 96) + 32 ; input[18] = 083 ; input = 'S' ;
'7' ; 055 = (((input[19]-32) + (083-32)) % 96) + 32 ; input[19] = 100 ; input = 'd' ;
'M' ; 077 = (((input[20]-32) + (100-32)) % 96) + 32 ; input[20] = 105 ; input = 'i' ;
'6' ; 054 = (((input[21]-32) + (105-32)) % 96) + 32 ; input[21] = 077 ; input = 'M' ;
'm' ; 109 = (((input[22]-32) + (077-32)) % 96) + 32 ; input[22] = 064 ; input = '@' ;
'3' ; 051 = (((input[23]-32) + (064-32)) % 96) + 32 ; input[23] = 115 ; input = 's' ;
'g' ; 103 = (((input[24]-32) + (115-32)) % 96) + 32 ; input[24] = 116 ; input = 't' ;
'=' ; 061 = (((input[25]-32) + (116-32)) % 96) + 32 ; input[25] = 073 ; input = 'I' ;

I know that there are more than one possible solutions for each of the input since the modulo operation is involved, but try to select the value of a displayable character (i.e. one between 32 and 127).

Anyway the list of inputs that you will end up with is reversed because it is the last value that is encrypted and printed onto the screen first given that the input characters were pushed and popped from a stack which we all know operates on a first-in-last-out basis.

In order to get the flag, you just need to reverse the input characters.

The flag is Its@MidSuMm3rNights3xpl0!t.


 

This concludes my write-up for the picoCTF 2017 event for now. I might solve some of the other challenges in there since it still seems to be open despite the contest having closed last April 14, so stay tuned!

Password-based Database Encryption

I’ve recently started doing some web application development again after stumbling about in the world of Information Security for the past two years of my career, and I’ve come to notice the many flaws in the way that I have always written code. One particular area of weakness in my ever-evolving application development methodology involves my use, or rather, non-use of encryption when storing sensitive user information in the database. Don’t worry though. That was back in college and I assure you that you won’t find any of the insecure systems that I have written deployed in the wild.

Table of Contents

The Common Approach

[ Back to Table of Contents ]

pwd-db-crypt-fig-1
Use of Single Key to Encrypt Database Data

After researching about this topic for days on end, I’ve come to notice that what seems to be the “normal” practice is to store a clear-text application-wide encryption key in a configuration file somewhere in the web application server and then use that key to encrypt data using some symmetric encryption algorithm (usually AES-256-CBC) before pushing the data onto the database. And then when you have to retrieve the encrypted data, you just pull it off the database and run it through the decryption routine of your chosen algorithm using the application-wide encryption key that you have stored on the web server.

Problem: Single Point of Failure

[ Back to Table of Contents ]

The approach above may be adequate for most cases, but I wanted to take it a bit further. I wanted the appropriate users to be the only ones to have the ability to decrypt the data that they should be able to view. If possible, I wanted the attacker to not have the ability to decrypt sensitive data from the database even after compromising both the web server and the database.

As you may have noticed, the approach described in the section above suffers from a single point of failure issue because there’s only one key for all the encrypted data in the entire application and having it compromised may allow some hack to decrypt our precious data in case they get a hold of our database (so make that two points of failures, but you get my drift). However, that should be unlikely because you, dear developer, are guarding our database against injection attacks by using prepared statements or some sort of escaping routine right? And you did apply appropriate access control to our database by using a different user account per type of transaction that our application can perform right? And you did apply appropriate table/function/view/procedure access permissions to those user accounts right? Good! But you can never be too careful.

Solution: Encrypt Sensitive Data Using User Passwords

[ Back to Table of Contents ]

pwd-db-crypt-fig-2
Use of User Password-Derived Encryption Keys to Encrypt User Data

To eliminate that single point of failure. Why don’t we delegate the management of encryption keys to our users? I mean, aren’t they already holding onto something that’s a key of sorts?

If I wasn’t clear enough, I’m talking about their passwords.

Think about it. If we run our user’s passwords through some Key Derivation Function (KDF) and use the result to encrypt the data that only they should be able to access, then we won’t have to store any clear-text encryption keys anywhere. The encryption keys would be stored in the brains of our users and any key leakages will just expose the stuff of those users involved in the leakage.

Anyway, in this scheme, we just ask the user to enter their password whenever they need to perform some operation involving their encrypted data, run their password through our chosen KDF, decrypt their data using the key generated by the KDF, perform the operation using the clear-text form of their data, and re-encrypt again when we need to store it somewhere else.

Nice and clean!

Do note that we would have to avoid storing the key generated by the KDF into the database because we’d just be providing the encryption key to the adversary in the case that they be able to get a hold of the database. I’m pointing this out because PBKDF2 is a common way to derive encryption keys from a password while also being a common way to store password hashes in the database. If you’re using PBKDF2 to store user password hashes AND derive encryption keys from user passwords, you should probably use a different salt for the password hash of your user and a different salt for their encryption key.

So that would be:

encryption_key_salt = generate_salt()
encryption_key = pbkdf2(encryption_key_salt, password)
password_hash_salt = generate_salt()
password_hash = pbkdf2(password_hash_salt, password)

Again, do not store the encryption_key into the database as you will be using that to encrypt and decrypt user data. Instead, you will just have to compute it every time you need to access or write encrypted user data from and to the database. This would however require that the user enter their password for every transaction involving their encrypted data.

You should only store the encryption_key_iv, password_hash_iv, and password_hash into the database. If you’re asking why we’re storing the password hash but not the encryption key, it’s because we use the hash to determine whether or not the user has entered the correct password. Reversing a hash is computationally infeasible so it should be safe to store it in the database.

Anyway, according to what I’ve read, you’d generally want to store passwords as BCrypt instead of PBKDF2, but using the results of BCrypt as an encryption key isn’t such a good idea because it produces a fixed-length result and you wouldn’t be able to adjust it to encryption algorithms that would require a length of more or less than that of the results produced by BCrypt.

By this point, our database should look something a bit like this:

pwd-db-crypt-fig-8
Schema for Password-based Encryption

Table: User

  • id
    • Primary key identifying a user
  • username_sha256
    • Unsalted SHA256 hash of the user’s username
  • password_salt
    • Salt used to compute the user’s BCrypt password hash
  • password_bcrypt
    • The BCrypt password hash of the user
    • Used for authenticating the user
  • pbkdf2_key_salt
    • Salt used for computing the user’s PBKDF2 password-based encryption key

Table: DataEntity

  • id
    • Primary key identifying a data entity
  • user_id
    • The identifier of the user owning the data within the data entity
  • value_iv
    • The Initialization Vector for encrypting and decrypting the value of the data within the data entity
  • value_aes256cbc
    • The encrypted value of the data within the data entity
    • Encrypted using AES-256-CBC and uses the user’s password-based encryption key for encryption

Problem: Changing Passwords

[ Back to Table of Contents ]

But what if the user wants to change their password? Wouldn’t that prevent us from accessing the data already encrypted with their passwords unless we re-encrypt each of them one-by-one. Do we really have to re-encrypt all their data? That would be terribly inefficient once the amount of data scales up!

Yes. That would be a very troublesome consequence indeed if we do decide to use user passwords to encrypt our users’ data. Fortunately, there is a work around to this and it involves encrypting our users’ data with an intermediate key rather than using a derivation of the user password itself.

Solution: Use An Intermediate Encryption Key

[ Back to Table of Contents ]

pwd-db-crypt-fig-3
Using Intermediate Encryption Keys to Encrypt or Decrypt Data

Basically, what we would do is: we randomly generate an encryption key for each of our users and then use that encryption key to encrypt their data. We then encrypt each user’s encryption key with a derivation of their password (using a KDF) and then store the encrypted user encryption key in the appropriate row in the user table. Whenever we need to encrypt or decrypt some data for a specific user, we just ask them for their password, run it through the same Key Derivation Function we used earlier, use the result to decrypt the user’s encrypted encryption key, and then use the now-decrypted encryption key to encrypt or decrypt user data.

So that would be:

encryption_key_pbkdf_salt = generate_salt()
encryption_key_aes_iv = generate_iv()
encryption_key = aes256(
    encryption_key_aes_iv,
    pbkdf2(encryption_key_pbkdf_salt, password),
    generate_encryption_key()
)

Instead of:

encryption_key_salt = generate_iv()
encryption_key = pbkdf2(encryption_key_salt, password)

Using this approach, we would only need to re-encrypt the user’s intermediate encryption key whenever the user decides to change their password instead of having to re-encrypt all data involving a specific user, thus, saving us a valuable amount of computing resources.

Do note that we would have no choice but to store the encrypted encryption key in the database unlike the previous approach, but this would eliminate the difficulty with changing user passwords.

By this point, our database should look something a bit like this:

pwd-db-crypt-fig-9
Schema when Intermediate Keys are used

Table: User

  • id
    • Primary key identifying a user
  • username_sha256
    • Unsalted SHA256 hash of the user’s username
  • password_salt
    • Salt used to compute the user’s BCrypt password hash
  • password_bcrypt
    • The BCrypt password hash of the user
    • Used for authenticating the user
  • pbkdf2_key_salt
    • Salt used for computing the user’s PBKDF2 password-based encryption key
  • user_key_iv
    • Initialization Vector for encrypting and decrypting the user’s intermediate key
  • user_key_aes256cbc
    • The encrypted value of the user’s intermediate key
    • Encrypted using AES-256-CBC and uses the user’s password-based encryption key for encryption

Table: DataEntity

  • id
    • Primary key identifying a data entity
  • user_id
    • The identifier of the user owning the data within the data entity
  • value_iv
    • The Initialization Vector for encrypting and decrypting the value of the data within the data entity
  • value_aes256cbc
    • The encrypted value of the data within the data entity
    • Encrypted using AES-256-CBC and uses the user’s intermediate encryption key for encryption

Problem: Resetting Passwords

[ Back to Table of Contents ]

So we now have the capability to allow users to change their passwords. But what if the user forgets their password and they now have to reset it? We’re in a bit of a bind here since we would need the derived encryption key from the user’s password in order to re-encrypt their encryption key. We don’t exactly have the user’s password stored in a manner where we would be able to derive the original password or its derivation so we’re pretty much stuck.

Solution: Recovery Codes and Security Questions

[ Back to Table of Contents ]

pwd-db-crypt-fig-4
Using Recovery Codes to Create Recoverable Copies of the Intermediate Key

The next best thing would be to have other encrypted copies of the user’s encryption key using alternative “passwords” in the form of randomly generated recovery codes or answers to security questions.

When using randomly generated recovery codes, you’d want to provide it to the user upon registration and then provide them the ability to randomly generate a new set on demand. You are going to need access to the user’s encryption key when generating recovery codes and so you are going to have to ask the user for their password when generating a new set. What you’d basically do is: randomly generate a recovery code, hash it with BCrypt for later authentication,  encrypt the user’s encryption key using a derivation of the recovery code that was just generated, and then store the hash of the recovery code and encrypted copy of the user’s encryption key in a recovery codes table in your database.

In a nutshell, that would be:

recovery_code = generate_recovery_code()
recovery_code_salt = generate_salt()
recovery_code_hash = bcrypt(recovery_code_salt, recovery_code)
encryption_key_pbkdf_salt = generate_salt()
encryption_key_aes_iv = generate_iv()
encryption_key = aes256(
    encryption_key_aes_iv,
    pbkdf2(encryption_key_pbkdf_salt, recovery_code),
    encryption_key_clear
)

As stated above, recovery codes are a form of alternative “passwords” and so we would have to follow the rules of password storage when storing recovery codes, and those are: 1) never store them in clear text, and 2) store them as a hash (preferably using BCrypt) so you can authenticate later. You are going to have to store the encrypted encryption key though because it needs to be recovered and re-encrypted with the user’s new password later on when the user resets his or her password.

Upon provision of the recovery codes, the user should be instructed to store them in a safe place such as a password manager or an actual physical safe. When users forget their password, they would need to provide one of the recovery codes in order to decrypt their encrypted encryption key and then re-encrypt it with the derivation of the new password of their choosing. You should also invalidate the old recovery codes and generate a new set when the user resets his or her password.

When using security questions, you can either ask users for common personal questions or give them the ability to provide their own set of security questions. In the case of common personal questions, you’d want to use multiple ones and then have combinations of answers serve as the user’s recovery codes. In the case of custom user-provided questions, you might be able to use the answers as recovery codes individually given that you require the users to provide a minimum length for each of their answers.

As for storing answers to security questions, you’d pretty much handle them in the same manner as you would recovery codes, so you may refer to the instructions above when storing the answers to security questions.

By this point, our database should look something a bit like this:

pwd-db-crypt-fig-10.png
Schema when Recovery Codes are used

Table: User

  • id
    • Primary key identifying a user
  • username_sha256
    • Unsalted SHA256 hash of the user’s username
  • password_salt
    • Salt used to compute the user’s BCrypt password hash
  • password_bcrypt
    • The BCrypt password hash of the user
    • Used for authenticating the user
  • pbkdf2_key_salt
    • Salt used for computing the user’s PBKDF2 password-based encryption key
  • user_key_iv
    • Initialization Vector for encrypting and decrypting the user’s intermediate key
  • user_key_aes256cbc
    • The encrypted value of the user’s intermediate key
    • Encrypted using AES-256-CBC and uses the user’s password-based encryption key for encryption

Table: DataEntity

  • id
    • Primary key identifying a data entity
  • user_id
    • The identifier of the user owning the data within the data entity
  • value_iv
    • The Initialization Vector for encrypting and decrypting the value of the data within the data entity
  • value_aes256cbc
    • The encrypted value of the data within the data entity
    • Encrypted using AES-256-CBC and uses the user’s intermediate encryption key for encryption

Table: RecoveryCode

  • id
    • Primary key identifying a recovery code
  • user_id
    • The identifier of the user owning the recovery code and user key contained within the recovery code entity
  • recovery_code_salt
    • Salt used to compute the recovery code’s BCrypt hash
  • recovery_code_bcrypt
    • The BCrypt hash of the recovery code
    • Used for authenticating the recovery code
  • pbkdf2_keysalt
    • Salt used for computing a recovery code-based encryption key using PBKDF2
  • user_key_iv
    • Initialization Vector for encrypting and decrypting the user’s intermediate key
  • user_key_aes256cbc
    • Encrypted value of the user’s intermediate key
    • Encrypted using AES-256-CBC and uses the recovery code-based encryption key for encryption

Problem: Sharing Data With Other Users

[ Back to Table of Contents ]

But wait. What if a piece of encrypted data created by one user needs to be shared with another? Surely the method of having an encryption key encrypt sensitive data per user won’t work since in order to decrypt a piece of encrypted data, the data owner will have to enter his or her password. Other users do not have access to the data owner’s password so they won’t be able to read the data. An approach one might try is to obtain an unencrypted copy of the owner’s data when the owner invokes the action of sharing his or her data to another party (where we would have to ask the data owner for their password), but then we won’t be able to encrypt the data upon receipt of the second party because then we would need access to their password. So using this, scheme, both parties will have to do the process of sharing in real time. This would greatly reduce the usability of your system. However, there is hope, and it’s called Asymmetric Encryption.

Solution: Use Asymmetric Encryption Scheme

[ Back to Table of Contents ]

pwd-db-crypt-fig-5
Encrypting Shared Data with Recipient’s Public Key and Decrypting it with their Private Key

Asymmetric Encryption in a nutshell is where you have a Public Key and a Private Key. Things you encrypt with your Public Key can only be decrypted by your Private Key, and things you encrypt with your Private Key can only be decrypted by your Public Key. In an Asymmetric Encryption scheme, you share your Public Key with everyone while you keep the Private Key to yourself. Everyone then uses your Public Key to encrypt data they want to share with you, and then you use your Private Key to decrypt it.

If we want to apply this scheme to our system, we would have to assign a Public Key and a Private Key to each user. We then encrypt everyone’s Private Key using the intermediate symmetric encryption keys assigned to them as described in the earlier sections. You may also encrypt user Private Keys using the derivation of your users’ passwords directly but you will have to re-encrypt two keys when the users change or reset their passwords. You may alternatively forgo the use of the symmetric intermediate encryption key and just encrypt Private Keys with the derivations of your users’ passwords directly though I would suggest having both in case we would need to store data that only the data owner should be able to access.

So in a nutshell, what’s going to happen is:

public_key, private_key = generate_key_pair()
private_key_aes_iv = generate_iv()
private_key_encrypted = aes256(
    private_key_aes_iv,
    encryption_key,
    private_key
)

If a user then wants to share their data with another user, we won’t need to have access to the recipient’s password in order for the owner to share their data with the recipient. We just obtain the recipient’s Public Key (which we store in clear-text), encrypt the data that we want to share with them using said Public Key, and then give them access to the data encrypted with their Public Key which they will then decrypt with their Private Key.

By this point, our database should look something a bit like this:

pwd-db-crypt-fig-11.png
Schema when Data Sharing via Asymmetric Encryption is used

Table: User

  • id
    • Primary key identifying a user
  • username_sha256
    • Unsalted SHA256 hash of the user’s username
  • password_salt
    • Salt used to compute the user’s BCrypt password hash
  • password_bcrypt
    • The BCrypt password hash of the user
    • Used for authenticating the user
  • pbkdf2_key_salt
    • Salt used for computing the user’s PBKDF2 password-based encryption key
  • user_key_iv
    • Initialization Vector for encrypting and decrypting the user’s intermediate key
  • user_key_aes256cbc
    • The encrypted value of the user’s intermediate key
    • Encrypted using AES-256-CBC and uses the user’s password-based encryption key for encryption
  • public_key
    • The unencrypted public key of the user
  • private_key_iv
    • The Initialization Vector for encrypting and decrypting the user’s private key
  • private_key_aes256cbc
    • The encrypted value of the user’s private key
    • Encrypted using AES-256-CBC and uses the user’s intermediate key for encryption

Table: DataEntity

  • id
    • Primary key identifying a data entity
  • user_id
    • The identifier of the user owning the data within the data entity
  • value_iv
    • The Initialization Vector for encrypting and decrypting the value of the data within the data entity
  • value_aes256cbc
    • The encrypted value of the data within the data entity
    • Encrypted using AES-256-CBC and uses the user’s intermediate encryption key for encryption

Table: RecoveryCode

  • id
    • Primary key identifying a recovery code
  • user_id
    • The identifier of the user owning the recovery code and user key contained within the recovery code entity
  • recovery_code_salt
    • Salt used to compute the recovery code’s BCrypt hash
  • recovery_code_bcrypt
    • The BCrypt hash of the recovery code
    • Used for authenticating the recovery code
  • pbkdf2_keysalt
    • Salt used for computing a recovery code-based encryption key using PBKDF2
  • user_key_iv
    • Initialization Vector for encrypting and decrypting the user’s intermediate key
  • user_key_aes256cbc
    • Encrypted value of the user’s intermediate key
    • Encrypted using AES-256-CBC and uses the recovery code-based encryption key for encryption

Table: SharedDataEntity

  • id
    • Primary key identifying an instance of shared data
  • data_id
    • The identifier of the data entity whose value is being shared
  • recipient_id
    • The identifier of the recipient user
  • value_rsa
    • Encrypted value of the data being shared to the recipient user
    • Encrypted using RSA and uses the Public Key of the receiving user for encryption
    • Decrypted using RSA and uses the Private key of the receiving user for decryption

Problem: Multiple Invocations of Asymmetric Algorithms

[ Back to Table of Contents ]

This isn’t really a problem to be honest but it would be best if we could limit the number of times that encryption and decryption takes place using an Asymmetric algorithm because it’s known to be very very slow. We can get around this limitation by allowing two communicating users to exchange a shared encryption key with each other and then use that encryption key to encrypt and decrypt the data that should be shared between the two users.

Solution: Perform Encryption Using Shared Key

[ Back to Table of Contents ]

pwd-db-crypt-fig-6
Using a Shared Key Shared Via Asymmetric Encryption to Encrypt and Decrypt Shared Data

So basically, what would happen is: we randomly generate an encryption key that will be shared between two users who want to communicate, create two copies of it, encrypt one copy with the Public Key of one user and encrypt the other with the Public Key of the other user, and then give each user the copy which was encrypted with their Public Key. And then, to acquire the shared key, each user would then decrypt the copy of the shared key which was encrypted with their Public Key using their Private Key and then re-encrypt it using their intermediate encryption key so they wouldn’t need to decrypt the shared encryption key over and over again using their Private Key thus avoiding the repetitive use of slow Asymmetric Encryption and Decryption.

So that would be:

shared_key = generate_encryption_key()
shared_key_rsa_user0 = rsa_encrypt(public_key_user0, shared_key)
shared_key_rsa_user1 = rsa_encrypt(public_key_user1, shared_key)

Again, don’t store the shared_key in the database. Store only the copies of the shared key that was encrypted by the Public Keys of the users.

And then on the side of each user:

shared_key = rsa_decrypt(private_key, shared_key_rsa)
shared_key_aes_iv = generate_iv()
shared_key_aes = aes256(
    shared_key_aes_iv,
    encryption_key,
    shared_key
)

The interesting thing about this approach is that you could scale it up and have a shared key between not just two users, but an entire group of users. Anyone with access to the shared key can potentially add new users to the group, but you would most likely want to limit this capability to certain users by enforcing it through your application’s code.

By this point, our database should look something a bit like this:

pwd-db-crypt-fig-12.png
Schema when Shared Keys are used

Table: User

  • id
    • Primary key identifying a user
  • username_sha256
    • Unsalted SHA256 hash of the user’s username
  • password_salt
    • Salt used to compute the user’s BCrypt password hash
  • password_bcrypt
    • The BCrypt password hash of the user
    • Used for authenticating the user
  • pbkdf2_key_salt
    • Salt used for computing the user’s PBKDF2 password-based encryption key
  • user_key_iv
    • Initialization Vector for encrypting and decrypting the user’s intermediate key
  • user_key_aes256cbc
    • The encrypted value of the user’s intermediate key
    • Encrypted using AES-256-CBC and uses the user’s password-based encryption key for encryption
  • public_key
    • The unencrypted public key of the user
  • private_key_iv
    • The Initialization Vector for encrypting and decrypting the user’s private key
  • private_key_aes256cbc
    • The encrypted value of the user’s private key
    • Encrypted using AES-256-CBC and uses the user’s intermediate key for encryption

Table: DataEntity

  • id
    • Primary key identifying a data entity
  • user_id
    • The identifier of the user owning the data within the data entity
  • value_iv
    • The Initialization Vector for encrypting and decrypting the value of the data within the data entity
  • value_aes256cbc
    • The encrypted value of the data within the data entity
    • Encrypted using AES-256-CBC and uses the user’s intermediate encryption key for encryption

Table: RecoveryCode

  • id
    • Primary key identifying a recovery code
  • user_id
    • The identifier of the user owning the recovery code and user key contained within the recovery code entity
  • recovery_code_salt
    • Salt used to compute the recovery code’s BCrypt hash
  • recovery_code_bcrypt
    • The BCrypt hash of the recovery code
    • Used for authenticating the recovery code
  • pbkdf2_keysalt
    • Salt used for computing a recovery code-based encryption key using PBKDF2
  • user_key_iv
    • Initialization Vector for encrypting and decrypting the user’s intermediate key
  • user_key_aes256cbc
    • Encrypted value of the user’s intermediate key
    • Encrypted using AES-256-CBC and uses the recovery code-based encryption key for encryption

Table: SharedKey

  • id
    • Primary key identifying a shared key
  • shared_key_salt
    • Salt used to compute the BCrypt hash of the shared key
  • shared_key_bcrypt
    • The BCrypt hash of the shared key
    • Used for authenticating decrypted versions of associated copies of the shared key

Table: SharedKeyCopy

  • id
    • Primary key identifying a copy of a shared key
  • key_id
    • The identifier of the shared key that an instance derives from
  • owner_id
    • The identifier of the user owning the copy of the shared key
  • shared_key_rsa
    • Encrypted value of the copy of the shared key
    • Encrypted using RSA and uses the Public Key of the owning user for encryption
    • Decrypted using RSA and uses the Private key of the owning user for decryption
  • shared_key_iv
    • Initialization Vector for encrypting and decrypting the copy of the shared key which was encrypted using AES-256-CBC
    • Initially empty at first, but is filled up once the user first accesses the clear-text value of the shared key
  • shared_key_aes256cbc
    • Encrypted value of the copy of the shared key
    • Encrypted using AES-256-CBC and uses the user’s intermediate key for encryption
    • Initially empty at first, but is filled up once the user first accesses the clear-text value of the shared key
    • The application should check whether or not this value exists first before trying to decrypt the copy of the shared key encrypted using RSA

Table: SharedDataEntity

  • id
    • Primary key identifying a shared data entity
  • key_id
    • The identifier of the shared key that was used to encrypt the value of the data contained within the entity
  • value_iv
    • Initialization Vector for encrypting and decrypting the value of the data contained wihtin the shared data entity
  • value_aes256cbc
    • The encrypted value of the data within the data entity
    • Encrypted using AES-256-CBC and uses the shared encryption key for encryption

Problem: User Password Compromise

[ Back to Table of Contents ]

Of course, there’s always the chance that the user’s password can be compromised by an attacker. If an attacker gets access to the user’s password and also gets a hold of the database, they would be able to decrypt the user’s data. However, they could also just walk right in through the front door and login using the user’s account in the application and then access the user’s data from there, but I believe that it would be best practice to ensure that decryption of user data can only be done with the consent of both the application and the user, so I would suggest that you optionally add another layer of encryption to your users’ data by additionally encrypting them using an application-wide encryption key.

Solution: Encrypt With Application-Wide Key

[ Back to Table of Contents ]

pwd-db-crypt-fig-1
Using a Layer of Application-wide Encryption

You don’t really have to do this, but if you want to ensure that your data can only be read by the appropriate user using the appropriate application, you can add another layer of encryption on top of the encryption you’re already doing with user-specific encryption keys.

Honestly, we’re really just back to square one here. But instead of solely relying on the common approach, we’d be applying it on top of the approach we’re already doing with user-specific keys. So, basically, as described earlier, the common approach is having a symmetric encryption key stored on a configuration file on the application server, and then using that to encrypt every piece of sensitive data that we store in the database. This way, the adversary would have to compromise both the application server and the user’s password in order to be able to read data directly from the database. It won’t stop them from just waltzing into the application and logging in as the user though, as stated earlier on.

What Data to Encrypt and How to Encrypt Them

[ Back to Table of Contents ]

Here’s a short taxonomy of what kind of data you should encrypt, how to encrypt them, and which keys you should use when encrypting.

  • Secrets
    • Password
      • Algorithm: BCrypt
      • Key: N/A
    • Intermediate Key
      • Algorithm: AES-256-CBC
      • Key: PBKDF2 of Password
    • Private Key
      • Algorithm: AES-256-CBC
      • Key: Intermediate Key
    • Shared Key
      • Non-cached Copy
        • Algorithm: RSA
        • Key: Public Key
      • Cached Copy
        • Algorithm: AES-256-CBC
        • Key: Intermediate Key
  • Sensitive Data
    • User-only
      • Algorithm: AES-256-CBC
      • Key: Intermediate Key
    • Shared with Specific User
      • Non-cached Copy
        • Algorithm: RSA
        • Key: Recipient Public Key
      • Cached Copy
        • Algorithm: AES-256-CBC
        • Key: Shared Key (between owner and recipient)
    • Shared with a Group
      • Algorithm: AES-256-CBC
      • Key: Shared Key (between members of the group

One final note: don’t go nuts and just encrypt everything. You’d want to store as much data in unencrypted form as your security policy will allow because 1) encryption consumes computing resources and 2) you can’t perform queries based on the contents of an encrypted column.

Hack the Vote CTF 2016 Write Up

Okay. Super late write up right here.

Anyway, as you, dear reader, may not be aware of, me and some friends participated in the Hack the Vote CTF 2016 this last November. I wasn’t able to work on the puzzles as much as I wanted to because of work obligations, but I was able to solve a really interesting reverse engineering problem that I think you might find interesting.

Reversing 100 (Consul)

ctf-hackthevote-2016-rev-100
Behold! The reverse engineering 100 challenge!

As a greenhorn in the magnificent field of reverse engineering, this was the only one that I was able to manage in the short amount of time that the challenge was open, but it’s better than nothing I guess, and I believe that other greenhorns such as myself would learn very much from this very simple example of reverse engineering. And so, off we go!

Requirements

Process

  1. Download the ELF binary by clicking the “consul” link the description of the challenge and try running it. We want to be able to observe what the binary does normally before we break it down into tiny little pieces.

    ctf-hackthevote-2016-rev-100-fig-1
    Sad
  2. Nothing useful from there. Go ahead and decompile the ELF binary with the REC Decompiler, and then locate the main function.
    ctf-hackthevote-2016-rev-100-fig-2
  3. Alright. Now let’s look at the other interesting functions above the main function.
    ctf-hackthevote-2016-rev-100-fig-3ctf-hackthevote-2016-rev-100-fig-4ctf-hackthevote-2016-rev-100-fig-5ctf-hackthevote-2016-rev-100-fig-6
  4. These, aside from the main function, are the only ones whose names are not gibberish, so it may be possible that the solution to the problem might indeed be in one of these functions, but if you look at the main function again, it doesn’t seem to access any of the functions above.
    ctf-hackthevote-2016-rev-100-fig-2
  5. Luckily for us though, we can arbitrarily call functions from ELF binaries by using the wonderful tool called “gdb” – also known as the GNU Debugger. So just go pass the ELF as a parameter to the tool in question, and we’ll be doing the magic along the way.
  6. So, first, we’d want to be able to get a hold of the application in it’s initial state, so we are going to have to stop at the main function once the application execution gets there. To do this, just do a break main after entering gdb in order to schedule the suspension of the application’s execution immediately after entering the main function, and then do a run to start the program’s execution.
    ctf-hackthevote-2016-rev-100-fig-7
  7. Now, let’s try calling some functions. You can arbitrarily call functions in the ELF binary using a really neat command called call.
    ctf-hackthevote-2016-rev-100-fig-8
  8. Only real_help seems to work. I think it was trying to give people a clue. Something about the Fibonacci sequence, but I never really understood the purpose, and I was able to finish the challenge even while ignoring it.
  9. Let’s just take a closer look at the other functions and see where they fail.
  10. Let’s do the dont_call_me function first. Since we already called the other functions, let’s start gdb again and call the dont_call_me function so we’d avoid any of the possible unintended modifications that the other functions may have made to the environment when we tested them previously.
    ctf-hackthevote-2016-rev-100-fig-9
  11. The dont_call_me function failed as expected. By observing the back trace using the bt command, we can see that the program stopped inside a function that is inside the sub_41F2 function. Now, the sub_41F2 function only contains a handful of calls to other functions, so it would only be one of the following.
    ctf-hackthevote-2016-rev-100-fig-10
  12. Let’s see if it’s the strlen function. To check, simply add a break to strlen and then call the sub_41F2 function.
    ctf-hackthevote-2016-rev-100-fig-11ctf-hackthevote-2016-rev-100-fig-12
  13. As you see, the program stopped at the start of strlen. We then entered the finish command so that the program would execute until the strlen function finished. There doesn’t seem to have been any problems with the strlen function because the segmentation fault occurred AFTER it finished (as indicated by the segmentation fault message). That means, the problem occurs either on the m0 function, or the other strlen function.
  14. Let’s check the m0 function next. As you can see in the code below, m0 is actually a variable meant to contain a pointer to a function which is supposed to be called on the *m0(); line below.
    ctf-hackthevote-2016-rev-100-fig-10
  15. However, upon inspecting the contents of the m0 variable using the p command, it seems that it contains a reference to NULL so naturally the program WILL throw a segmentation fault for attempting to call NULL as a function.
    ctf-hackthevote-2016-rev-100-fig-13
  16. A quick search in the source code reveals that the m0 variable is initialised inside the c55 function.
    ctf-hackthevote-2016-rev-100-fig-27
  17. However, the c55 function is never actually called anywhere and directly calling the c55 function causes and infinite loop.
    ctf-hackthevote-2016-rev-100-fig-14
  18. Let’s restart gdb to get rid of the breakpoints and then let’s try setting the m0 variable directly.
    ctf-hackthevote-2016-rev-100-fig-15
  19. Now let’s try calling the dont_call_me function again.
    ctf-hackthevote-2016-rev-100-fig-16
  20. Looks like it’s still broken. Buuuuuuut… look carefully at the decompiled code. There’s another malloc function there residing on a different address. This might be the malloc function that’s being set to the m0 variable.
    ctf-hackthevote-2016-rev-100-fig-17
  21. Now let’s set m0 to that address. Good thing the decompiler outputs the addresses of decompiled functions (as indicated in the comment below the declaration of the malloc function).
    ctf-hackthevote-2016-rev-100-fig-18
  22. Now let’s call dont_call_me function again.ctf-hackthevote-2016-rev-100-fig-19
  23. >mfw

    ctf-hackthevote-2016-rev-100-fig-20
    MOM, GET THE CAMERA! #MLG #360NOSCOPE #GITREKT
  24. It works. But there doesn’t seem to be anything useful in the dont_call_me function. Let’s try the other functions.
    ctf-hackthevote-2016-rev-100-fig-21
    Illuminati Confirmed? [Insert X-Files Theme Music]
  25. It seems the other functions started working once we set the m0 variable to the right value. Now search for m0 in the source code and you will find that the functions that reference m0 are sub_41F2, sub_9F36, and sub_198A. The functions that, in turn, reference these functions are dont_call_me, help, c8, and fake_help. We haven’t tried c8 yet, so we just might find the answer there.

    ctf-hackthevote-2016-rev-100-fig-22
    What’s that? “Jet fuel can’t melt steel beams”?
  26. Gibberish. But to be sure, let’s do it again.

    ctf-hackthevote-2016-rev-100-fig-23
    “Harambe was an inside job”?
  27. Different gibberish. I wonder what will happen if we keep doing it over and over again.

    ctf-hackthevote-2016-rev-100-fig-24
    “Lizard people are running the White House”!?
  28. Suddenly… [insert x-files theme music]

    ctf-hackthevote-2016-rev-100-fig-25
    The plot thickens…
  29. That’s not the flag though. We need to go deeper.

    ctf-hackthevote-2016-rev-100-fig-26
    Voila!
  30. And finally, we have our flag, which is flag{write_in_bernie!}.

And that’s it for now folks. I hope you enjoyed this CTF write up despite it being a month late. I surely did enjoy writing it. Anyway, if you need the binary and the decompiled source code, I have attached them onto the section below.

Attachments

Binary
Decompiled Source Code