Tutorials
Translating plain language to filtering queries
Introduction
Building rules with Twitter premium operators to filter the Tweet firehose is key to ingesting just the right data for your platform. Whether your app can generate meaningful analysis or insights depends on receiving the exact data it needs, and your filtering rules are the lynchpin in that equation. Twitter premium operators provide extensive choices for defining the content in your stream – options include simple keyword matching, specifying specific user attributes, user location, language, and many more, all using complex boolean logic.
When defining rules that deliver the data you want, one of the biggest challenges is to take requirements as they’re expressed in plain language, and translate them precisely to the operators and filtering syntax. In this article, we will take rules that are articulated in English, and transform them into filtering rules using the appropriate premium operators and syntax.
Example 1
“I want Tweets discussing the Tesla Model 3.”
The easiest place to start is to use the simple keyword ‘tesla’. However, that is only the starting point – it will certainly get us all Tweets that mention ‘tesla,’ but it’s not really targeted. It would also match Tweets where people are discussing Tesla coils, or Nikola Tesla and his rivalry with Thomas Edison, rather than the car. There are two primary ways to target the conversation more precisely:
- Add additional positive keyword requirements
- Specify specific keywords to exclude.
Adding additional positive keywords generally provides a more significant improvement in the level accuracy, so we’ll begin there. Here, this is fairly simple – we want to restrict our ‘tesla’ matches to those where the exact phrase ‘model 3’ also appears. The filtering rule to represent this looks like the following:
tesla "model 3"
The double-quotes around the phrase ‘model 3’ turn this into an exact phrase match – Tweets will only match where that exact phrase appears in that order within the text.
However, maybe you’ve observed that users also refer to this using "model-3" or "model three", rather than ‘model 3’. If you want to also account for this scenario, you could use the following rule:
tesla ("model 3" OR "model-3" OR "model three" )
Here, we’ve employed grouping and boolean OR logic. To translate the rule above, Tweets would match where they contain "tesla", and where they contain either "model 3" or "model-3" or "model three".
This concept can be expanded on further – imagine you only want Tweets like this which use emotionally-charged keywords like "angry", "happy", or "love". These could be included as an additional group, using “and any of these” logic.
tesla ("model 3" OR "model-3" OR "model three" ) (angry OR happy OR love)
We now have a rule that is significantly more targeted than the original tesla rule. However, you may want greater confidence that you will exclude specific types of references, if they somehow manage to meet the requirements you’ve already specified.
To do this, you simply exclude the types of mentions you’ve identified as undesirable by placing a - character in front of them.
tesla ("model 3" OR "model-3" OR "model three" ) (angry OR happy OR love) -nikola -edison -coil -coils
Please note: It is a best practice to not group together negated operators using OR logic and parentheses. Instead, negate each individual separately. For example, do not group together the above negations into the following
-(nikola OR edison OR coil OR coils)
We now have a highly targeted rule that will deliver just the Tweets with the types of text phrases we want. To recap, here is the plain-language statement of what our Tweet will deliver Tweets which:
- Include "tesla" in the text.
- Also include either "model 3" or "model-3" or "model three" in the text.
- Also include "angry" or "happy" or "love" in the text.
- DO NOT include "nikola", "edison", "coil", or "coils" in the text.
Now let’s look at an example that incorporates filtering on some of the unique aspects of social data, outside of just plain text matching.
Example 2
“Our company is interested in some references about the TV show “Cutthroat Kitchen” on the Food Network, as well as references to the host (Alton Brown) that relate to the show.”
Using what we learned in the first example, we can create a rule to capture mentions like this.
"cutthroat kitchen" OR cutthroatkitchen OR (("alton brown" OR altonbrown) host OR show)
This rule would capture:
- Tweets with the phrase “cutthroat kitchen” in the text.
- Tweets with "cutthroatkitchen" in the text (same as above, but without the space).
- Tweets mentioning alton brown (or "altonbrown" without the space), where that tweet also mentions "host" or "show".
Now, let’s imagine that the company collects Tweets with this rule for a week, but their customer is unhappy with the quality of the results, and wants to target content more specifically. This time around, they want to narrow the results, but also want to capture some specific mentions they missed before.
“We only want Tweets using our promoted hashtag (#cutthroatkitchen), Tweets mentioning the show’s host by his Twitter handle (@altonbrown) in a way that relates to the show, or Tweets that link to the Food Network’s online page about Alton Brown. Additionally, we only want Tweets that come from users who say they are based in the United States.”
Let’s begin with the first requirement.
Our previous rule would have captured Tweets using the promoted hashtag, thanks to our "cutthroatkitchen" term. This would have matched mentions like "#cutthroatkitchen", "@cutthroatkitchen", or just a bare reference to "cutthroatkitchen". These matches are due to the use of a tokenized match that tokenizes using punctuation. However, our customer wants to restrict this to only match on uses of "#cutthroatkitchen". To do this, we’ll use the # operator as follows:
#cutthroatkitchen
While the original "cutthroatkitchen" term looked for matches in the general text of the Tweet, this rule actually changes the strategy, and looks for a match in the list of hashtags Twitter has extracted from the Tweet itself (with the native enriched format, this will match on the twitter_entities.hashtags field). Thus, it provides a much more targeted way for the customer to ensure they are only getting hashtag mentions of the phrase.
A similar concept applies to restricting the mentions of Alton Brown to those using his Twitter handle. The previous term ("altonbrown") would have gotten all mentions in the text of that specific string. However, we can use the @ operator to restrict it to ONLY explicit references to his Twitter handle.
@altonbrown
This means the rule will be applied to Twitter’s extracted user mentions (with the native enriched format, this will match on the twitter_entities.mentions field) for a match, rather than the general text used in the Tweet. Since we are now specifically referencing the Twitter account of the show's host, the verified @altonbrown account, we no longer need to require the "host" and "show" keywords.
Next, the customer was previously getting some Tweets linking to their web page about Alton Brown just by looking for mentions of his name within the text, including any URLs included in the text of the Tweet. However, they were missing some references where Twitter users shortened their URLs with services like bit.ly before posting them. To accommodate this need, we need to use the url_contains: operator with the specific URL the customer wants to track.
url_contains:"foodnetwork.com/chefs/alton-brown"
The url_contains: operator looks for matches in the fully expanded URL that is provided in the Tweet as an enrichment by Twitter. In other words, even if a URL is wrapped in a bit.ly or other shortened link, Twitter will unwind it down to the final URL and allow you to look for matches there.
Last, the customer wants to restrict the results to Tweets where the user is from the United States. To do this, we will use Twitter’s Profile Geo enrichment, and corresponding premium operator to apply the restriction to all of the previously defined terms.
Please note that geo operators are not currently available in Twitter Developer Labs.
profile_country:us
Incorporating the changes above, we can come up with a rule that will satisfy the customer:
profile_country:us (#cutthroatkitchen OR @altonbrown OR url_contains:"foodnetwork.com/chefs/alton-brown")
This would match the following:
- Tweets using the company’s promoted hashtag, but not those using the keyword without the hashtag.
- Mentions of @altonbrown, but excluding plain-text mentions that don’t use @ mention syntax.
- Tweets that include links to the Food Network’s page about Alton Brown, even where they are shortened using bit.ly or another service.
- Additionally, no Tweets meeting the requirements above will be delivered unless they also have a profile country code for the United States, based on Twitter's Profile Geo enrichment.
The syntax used is important – the use of parentheses where appropriate creates the boolean logic we want, and ensure that the profile_country:us operator is applied across the board. When in doubt, use parentheses to be sure you don’t end up with unexpected results due to the order of operations for rules.
Beyond these examples, there are hundreds of ways that you can combine operators and keywords to return the data that is critical to your analysis. Expanding these concepts to narrow your search based on profile information, follower count, Tweet location, language used in the text, and many more. In addition to the topics discussed here, you should be well-versed in the full documentation around operators and rules, including the limits around restricted characters and rule size. You can find these details at the following resources:
- Premium and enterprise operators and how to build queries
- Building queries for Twitter Developer Labs filtered stream and recent search
Ready to build your solution?