Отказ
Вы действительно должны использовать механизм синтаксического анализа HTML, так как существует много неясных краевых случаев, которые регулярное выражение не может легко разместить. Но я не твоя мама, поэтому я не собираюсь рассказывать тебе, как жить своей жизнью.
Поскольку кажется, что у вас есть некоторый контроль над исходным текстом, вы, вероятно, сможете полностью избежать случаев с неясными краями.
Описание
Это регулярное выражение будет делать следующее:
- найти все
div
теги
- обеспечивают каждый
div
Найден тег содержит class="user-actions
с или без цитаты
- захватывает значения
data-screen-name
, data-name
и data-protected
в свои собственные группы захвата
- позволяет атрибут/значение задает появляться в любом порядке
- позволяет кавычки вокруг значения будут необязательными, так что вы можете использовать одинарные кавычки, двойные кавычки или без кавычек
- скрабов цитаты из значений, так что вы получите Исходное значение
- избегает много грязных дел краев регулярных выражений полиции крика о том, когда соответствии HTML
The Regex
<div\b(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?class=['"]?user-actions)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?data-screen-name=(['"]?)(.*?)\1(?:\s|>))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?data-name=(['"]?)(.*?)\3(?:\s|>))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?data-protected=(['"]?)(.*?)\5(?:\s|>))(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*\s*>
Я рекомендую использовать флагов, нечувствительный к регистру.
Примеры
Выдержка из исходного текста
Крошечная часть вашего образца текста
<div class="user-actions btn-group not-following not-muting protected" data-user-id="726459723365502976"
data-screen-name="Just__Kidding__" data-name="Chaw Chin Fong" data-protected="true">
Живой пример
показывает только большую часть вашего общего файла потому что онлайн-инструмент увязывается с огромные строки текста.
https://regex101.com/r/bY1kH8/1
Захват Группы
- Group 0 получает все открытия Div тег
- Группа 1 получает присваивается котировку, если был один вокруг значения
data-screen-name
- Группа 2 получить значение
data-screen-name
, не включая котировки
- Группа 3 получает присваивается котировку, если был один вокруг значения
data-name
- Group 4 получить значение
data-name
, не включая кавычки
- Group 5 получает присваивается котировку, если был один вокруг значения
data-protected
- Группа 6 получить значение
data-protected
, не включая любые кавычки
Sample Матчи
Они были взяты из исходного текс t, используя предлагаемое регулярное выражение.
[0][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="2582252852"
data-screen-name="w33haa" data-name="Aliwi Omar" data-protected="false">
[0][1] = "
[0][2] = w33haa
[0][3] = "
[0][4] = Aliwi Omar
[0][5] = "
[0][6] = false
[1][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="1680222842"
data-screen-name="Jamjomon" data-name="Jamchu :3" data-protected="false">
[1][1] = "
[1][2] = Jamjomon
[1][3] = "
[1][4] = Jamchu :3
[1][5] = "
[1][6] = false
[2][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="1523823648"
data-screen-name="dimakoza4enko" data-name="Дима Козаченко" data-protected="false">
[2][1] = "
[2][2] = dimakoza4enko
[2][3] = "
[2][4] = Дима Козаченко
[2][5] = "
[2][6] = false
[3][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="1522238240"
data-screen-name="alupulipulipala" data-name="Wahid Arefin" data-protected="false">
[3][1] = "
[3][2] = alupulipulipala
[3][3] = "
[3][4] = Wahid Arefin
[3][5] = "
[3][6] = false
[4][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="4804204573"
data-screen-name="thanhbach195" data-name="Mai Thanh Bách" data-protected="false">
[4][1] = "
[4][2] = thanhbach195
[4][3] = "
[4][4] = Mai Thanh Bách
[4][5] = "
[4][6] = false
[5][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="726465523223908353"
data-screen-name="zeref980" data-name="Yan Naung Htet" data-protected="false">
[5][1] = "
[5][2] = zeref980
[5][3] = "
[5][4] = Yan Naung Htet
[5][5] = "
[5][6] = false
[6][0] = <div class="user-actions btn-group not-following not-muting protected" data-user-id="726459723365502976"
data-screen-name="Just__Kidding__" data-name="Chaw Chin Fong" data-protected="true">
[6][1] = "
[6][2] = Just__Kidding__
[6][3] = "
[6][4] = Chaw Chin Fong
[6][5] = "
[6][6] = true
[7][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="713605605638938624"
data-screen-name="Fruitcentre" data-name="Fruit & Veg Centre" data-protected="false">
[7][1] = "
[7][2] = Fruitcentre
[7][3] = "
[7][4] = Fruit & Veg Centre
[7][5] = "
[7][6] = false
[8][0] = <div class="user-actions btn-group not-following not-muting protected" data-user-id="555968644"
data-screen-name="aeronhalecastle" data-name="Eywon ツ" data-protected="true">
[8][1] = "
[8][2] = aeronhalecastle
[8][3] = "
[8][4] = Eywon ツ
[8][5] = "
[8][6] = true
[9][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="2845398050"
data-screen-name="Deheyb" data-name="4k Scrub✌️" data-protected="false">
[9][1] = "
[9][2] = Deheyb
[9][3] = "
[9][4] = 4k Scrub✌️
[9][5] = "
[9][6] = false
[10][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="721815663216566272"
data-screen-name="Ribbon2712" data-name="Даниил Демидов" data-protected="false">
[10][1] = "
[10][2] = Ribbon2712
[10][3] = "
[10][4] = Даниил Демидов
[10][5] = "
[10][6] = false
[11][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="3248438456"
data-screen-name="zayarmgmg95" data-name="Zayar Mg" data-protected="false">
[11][1] = "
[11][2] = zayarmgmg95
[11][3] = "
[11][4] = Zayar Mg
[11][5] = "
[11][6] = false
[12][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="726440286063198208"
data-screen-name="Ninderpy" data-name="Derpy" data-protected="false">
[12][1] = "
[12][2] = Ninderpy
[12][3] = "
[12][4] = Derpy
[12][5] = "
[12][6] = false
[13][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="423763655"
data-screen-name="ImJoehuff" data-name="JoeyT" data-protected="false">
[13][1] = "
[13][2] = ImJoehuff
[13][3] = "
[13][4] = JoeyT
[13][5] = "
[13][6] = false
[14][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="726441786839703556"
data-screen-name="zxmir_" data-name="Zxmir_" data-protected="false">
[14][1] = "
[14][2] = zxmir_
[14][3] = "
[14][4] = Zxmir_
[14][5] = "
[14][6] = false
[15][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="726440845713367041"
data-screen-name="hienlequang" data-name="Hiền Lê Quang" data-protected="false">
[15][1] = "
[15][2] = hienlequang
[15][3] = "
[15][4] = Hiền Lê Quang
[15][5] = "
[15][6] = false
[16][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="3032113115"
data-screen-name="Najer14" data-name="Jan" data-protected="false">
[16][1] = "
[16][2] = Najer14
[16][3] = "
[16][4] = Jan
[16][5] = "
[16][6] = false
[17][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="4762819022"
data-screen-name="7forOne" data-name="Abiel" data-protected="false">
[17][1] = "
[17][2] = 7forOne
[17][3] = "
[17][4] = Abiel
[17][5] = "
[17][6] = false
[18][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="717061680799330306"
data-screen-name="Th3uN1qu31" data-name="Th3_uN1Qu3" data-protected="false">
[18][1] = "
[18][2] = Th3uN1qu31
[18][3] = "
[18][4] = Th3_uN1Qu3
[18][5] = "
[18][6] = false
Разъяснения
NODE EXPLANATION
----------------------------------------------------------------------
<div '<div'
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
class= 'class='
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
user-actions 'user-actions'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
data-screen-name= 'data-screen-name='
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.*? any character (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
data-name= 'data-name='
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
.*? any character (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
\3 what was matched by capture \3
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
data-protected= 'data-protected='
----------------------------------------------------------------------
( group and capture to \5:
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of \5
----------------------------------------------------------------------
( group and capture to \6:
----------------------------------------------------------------------
.*? any character (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \6
----------------------------------------------------------------------
\5 what was matched by capture \5
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^'] any character except: '''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\\ '\'
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^"] any character except: '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\\ '\'
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
Вы должны использовать Html Agility пакет для этого. Но ты это уже знаешь. –
Но он отлично работает. (Http://i.stack.imgur.com/PN5bJ.png). Как вы получаете доступ к значению группы? Есть ли в вашем html несколько матчей? –
Да @MaximilianGerhardt существует более 1 совпадений, что является моей проблемой. – user6274399