Parser - Regex to result | Voters

Parser - Regex to result

complete

Holger

I suggest a Parser Regex option to keep the matching text as result. I'm thinking this is the classic way of scaping :-)

"Keep any text that matches the regular expression as the result, discard other."

February 16, 2023

Cahyo

marked this post as

complete

Cahyo

Added a new parser "Regex match keep" to keep content form an extraction. For example, keep only the first or all emails from a text paragraph.

Cahyo

The idea of the MrScraper is that with the extractors you already retrieve only the info you wanted, and with parses you can clean or modify each property.
Can you give an example scenario where you want to select an element and discard it later in the parsing phase? I may be able to add what you need.
Thanks

Holger

Yes,
i need:
"ratingValue": 7.6,
"ratingCount": 3,
<!DOCTYPE html>
<html lang="......ttp://www.w3.org/1999/xhtml">  
<head id="head">
<title> ....
<link rel= ...
<link rel= ...
<link rel= ...
<meta name=
<meta name=
<script type="application/ld+json">
[ {
"@context": "https://www.schema.org",
"@type": "Product",
....
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": 7.6,
"bestRating": 10,
"ratingCount": 3,
},
...}]
</script>
</head>
<body 
my idea was:
select head tag per xpath-selector,
maybe head->script tag (but with no specific markers this is not stable enought, if further script tags will be added)
grep the full "aggregateRating": { }  json part per regex parser 
returns this as result and extract the needed values in my script
or extract the values with two regex parsers and return only the numeric values
I have no experience if I can extract the values with crazy xpath terms, 
I thought the parsers are supposed to do the fine tuning on the result
thanks for help :-)
(sorry for the formatting, tried hard, but the form added unintended newlines)