-
Notifications
You must be signed in to change notification settings - Fork 226
/
Copy pathExercise_solutions.html
1062 lines (872 loc) · 72.9 KB
/
Exercise_solutions.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctypehtml><html class="sidebar-visible no-js light"lang=en><head><meta charset=UTF-8><title>Exercise Solutions - Understanding Python re(gex)?</title><meta content="text/html; charset=utf-8"http-equiv=Content-Type><meta content="Example based guide to mastering Python regular expressions"name=description><meta content=width=device-width,initial-scale=1 name=viewport><meta content=#ffffff name=theme-color><meta content="Understanding Python re(gex)?"property=og:title><meta content=website property=og:type><meta content="Example based guide to mastering Python regular expressions"property=og:description><meta content=https://learnbyexample.github.io/py_regular_expressions/ property=og:url><meta content=https://raw.githubusercontent.com/learnbyexample/py_regular_expressions/master/images/py_regex_ls.png property=og:image><meta content=1280 property=og:image:width><meta content=720 property=og:image:height><meta content=summary_large_image property=twitter:card><meta content=@learn_byexample property=twitter:site><link href=favicon.svg rel=icon><link rel="shortcut icon"href=favicon.png><link href=css/variables.css rel=stylesheet><link href=css/general.css rel=stylesheet><link href=css/chrome.css rel=stylesheet><link href=FontAwesome/css/font-awesome.css rel=stylesheet><link href=fonts/fonts.css rel=stylesheet><link href=highlight.css rel=stylesheet><link href=tomorrow-night.css rel=stylesheet><link href=ayu-highlight.css rel=stylesheet><link href=style.css rel=stylesheet><body><script>var path_to_root = "";
var default_theme = window.matchMedia("(prefers-color-scheme: dark)").matches ? "navy" : "light";</script><script>try {
var theme = localStorage.getItem('mdbook-theme');
var sidebar = localStorage.getItem('mdbook-sidebar');
if (theme.startsWith('"') && theme.endsWith('"')) {
localStorage.setItem('mdbook-theme', theme.slice(1, theme.length - 1));
}
if (sidebar.startsWith('"') && sidebar.endsWith('"')) {
localStorage.setItem('mdbook-sidebar', sidebar.slice(1, sidebar.length - 1));
}
} catch (e) { }</script><script>var theme;
try { theme = localStorage.getItem('mdbook-theme'); } catch(e) { }
if (theme === null || theme === undefined) { theme = default_theme; }
var html = document.querySelector('html');
html.classList.remove('no-js')
html.classList.remove('light')
html.classList.add(theme);
html.classList.add('js');</script><script>var html = document.querySelector('html');
var sidebar = 'hidden';
if (document.body.clientWidth >= 1080) {
try { sidebar = localStorage.getItem('mdbook-sidebar'); } catch(e) { }
sidebar = sidebar || 'visible';
}
html.classList.remove('sidebar-visible');
html.classList.add("sidebar-" + sidebar);</script><nav aria-label="Table of contents"class=sidebar id=sidebar><div class=sidebar-scrollbox><ol class=chapter><li class="chapter-item expanded affix"><a href=cover.html>Cover</a><li class="chapter-item expanded affix"><a href=buy.html>Buy PDF/EPUB versions</a><li class="chapter-item expanded"><a href=preface.html><strong aria-hidden=true>1.</strong> Preface</a><li class="chapter-item expanded"><a href=why-is-it-needed.html><strong aria-hidden=true>2.</strong> Why is it needed?</a><li class="chapter-item expanded"><a href=re-introduction.html><strong aria-hidden=true>3.</strong> re introduction</a><li class="chapter-item expanded"><a href=anchors.html><strong aria-hidden=true>4.</strong> Anchors</a><li class="chapter-item expanded"><a href=alternation-and-grouping.html><strong aria-hidden=true>5.</strong> Alternation and Grouping</a><li class="chapter-item expanded"><a href=escaping-metacharacters.html><strong aria-hidden=true>6.</strong> Escaping metacharacters</a><li class="chapter-item expanded"><a href=dot-metacharacter-and-quantifiers.html><strong aria-hidden=true>7.</strong> Dot metacharacter and Quantifiers</a><li class="chapter-item expanded"><a href=interlude-tools-for-debugging-and-visualization.html><strong aria-hidden=true>8.</strong> Interlude: Tools for debugging and visualization</a><li class="chapter-item expanded"><a href=working-with-matched-portions.html><strong aria-hidden=true>9.</strong> Working with matched portions</a><li class="chapter-item expanded"><a href=character-class.html><strong aria-hidden=true>10.</strong> Character class</a><li class="chapter-item expanded"><a href=groupings-and-backreferences.html><strong aria-hidden=true>11.</strong> Groupings and backreferences</a><li class="chapter-item expanded"><a href=interlude-common-tasks.html><strong aria-hidden=true>12.</strong> Interlude: Common tasks</a><li class="chapter-item expanded"><a href=lookarounds.html><strong aria-hidden=true>13.</strong> Lookarounds</a><li class="chapter-item expanded"><a href=flags.html><strong aria-hidden=true>14.</strong> Flags</a><li class="chapter-item expanded"><a href=unicode.html><strong aria-hidden=true>15.</strong> Unicode</a><li class="chapter-item expanded"><a href=regex-module.html><strong aria-hidden=true>16.</strong> regex module</a><li class="chapter-item expanded"><a href=gotchas.html><strong aria-hidden=true>17.</strong> Gotchas</a><li class="chapter-item expanded"><a href=further-reading.html><strong aria-hidden=true>18.</strong> Further Reading</a><li class="chapter-item expanded"><a class=active href=Exercise_solutions.html><strong aria-hidden=true>19.</strong> Exercise Solutions</a></li><br><hr><li class="chapter-item expanded"><i class="fa fa-github"id=git-repository-button></i><a href=https://github.com/learnbyexample/py_regular_expressions> Source code</a><li class="chapter-item expanded"><i class="fa fa-home"id=home-button></i><a href=https://learnbyexample.github.io/> My Blog</a><li class="chapter-item expanded"><i class="fa fa-book"id=book-button></i><a href=https://learnbyexample.github.io/books/> My Books</a><li class="chapter-item expanded"><i class="fa fa-envelope"id=mail-button></i><a href=https://learnbyexample.gumroad.com/l/learnbyexample-weekly> learnbyexample weekly</a><li class="chapter-item expanded"><i class="fa fa-twitter"id=twitter-button></i><a href=https://twitter.com/learn_byexample> Twitter</a></ol></div><div class=sidebar-resize-handle id=sidebar-resize-handle></div></nav><div class=page-wrapper id=page-wrapper><div class=page><div id=menu-bar-hover-placeholder></div><div class="menu-bar sticky bordered"id=menu-bar><div class=left-buttons><button aria-label="Toggle Table of Contents"title="Toggle Table of Contents"aria-controls=sidebar class=icon-button id=sidebar-toggle type=button><i class="fa fa-bars"></i></button><button aria-label="Change theme"title="Change theme"aria-controls=theme-list aria-expanded=false aria-haspopup=true class=icon-button id=theme-toggle type=button><i class="fa fa-paint-brush"></i></button><ul aria-label=Themes class=theme-popup id=theme-list role=menu><li role=none><button class=theme id=light role=menuitem>Light (default)</button><li role=none><button class=theme id=rust role=menuitem>Rust</button><li role=none><button class=theme id=coal role=menuitem>Coal</button><li role=none><button class=theme id=navy role=menuitem>Navy</button><li role=none><button class=theme id=ayu role=menuitem>Ayu</button></ul><button aria-label="Toggle Searchbar"title="Search. (Shortkey: s)"aria-controls=searchbar aria-expanded=false aria-keyshortcuts=S class=icon-button id=search-toggle type=button><i class="fa fa-search"></i></button></div><h1 class=menu-title>Understanding Python re(gex)?</h1><div class=right-buttons><a aria-label=Blog href=https://learnbyexample.github.io title=Blog> <i class="fa fa-home"id=home-button></i> </a><a aria-label=Twitter href=https://twitter.com/learn_byexample title=Twitter> <i class="fa fa-twitter"id=twitter-button></i> </a><a aria-label="Git repository"title="Git repository"href=https://github.com/learnbyexample/py_regular_expressions> <i class="fa fa-github"id=git-repository-button></i> </a></div></div><div class=hidden id=search-wrapper><form class=searchbar-outer id=searchbar-outer><input placeholder="Search this book ..."aria-controls=searchresults-outer aria-describedby=searchresults-header id=searchbar name=searchbar type=search></form><div class="searchresults-outer hidden"id=searchresults-outer><div class=searchresults-header id=searchresults-header></div><ul id=searchresults></ul></div></div><script>document.getElementById('sidebar-toggle').setAttribute('aria-expanded', sidebar === 'visible');
document.getElementById('sidebar').setAttribute('aria-hidden', sidebar !== 'visible');
Array.from(document.querySelectorAll('#sidebar a')).forEach(function(link) {
link.setAttribute('tabIndex', sidebar === 'visible' ? 0 : -1);
});</script><div class=content id=content><main><div class=sidetoc><nav class=pagetoc></nav></div><h1 id=exercise-solutions><a class=header href=#exercise-solutions>Exercise solutions</a></h1><blockquote><p><img alt=info src=images/info.svg> Try to solve exercises in every chapter using only the features discussed until that chapter. Some of the exercises will be easier to solve with techniques presented in later chapters, but the aim of these exercises is to explore the features presented so far.</blockquote><br><h1 id=re-introduction><a class=header href=#re-introduction>re introduction</a></h1><p><strong>1)</strong> Check whether the given strings contain <code>0xB0</code>. Display a boolean result as shown below.<pre><code class=language-python>>>> line1 = 'start address: 0xA0, func1 address: 0xC0'
>>> line2 = 'end address: 0xFF, func2 address: 0xB0'
>>> bool(re.search(r'0xB0', line1))
False
>>> bool(re.search(r'0xB0', line2))
True
</code></pre><p><strong>2)</strong> Replace all occurrences of <code>5</code> with <code>five</code> for the given string.<pre><code class=language-python>>>> ip = 'They ate 5 apples and 5 oranges'
>>> re.sub(r'5', 'five', ip)
'They ate five apples and five oranges'
</code></pre><p><strong>3)</strong> Replace only the first occurrence of <code>5</code> with <code>five</code> for the given string.<pre><code class=language-python>>>> ip = 'They ate 5 apples and 5 oranges'
>>> re.sub(r'5', 'five', ip, count=1)
'They ate five apples and 5 oranges'
</code></pre><p><strong>4)</strong> For the given list, filter all elements that do <em>not</em> contain <code>e</code>.<pre><code class=language-python>>>> items = ['goal', 'new', 'user', 'sit', 'eat', 'dinner']
>>> [w for w in items if not re.search(r'e', w)]
['goal', 'sit']
</code></pre><p><strong>5)</strong> Replace all occurrences of <code>note</code> irrespective of case with <code>X</code>.<pre><code class=language-python>>>> ip = 'This note should not be NoTeD'
>>> re.sub(r'note', 'X', ip, flags=re.I)
'This X should not be XD'
</code></pre><p><strong>6)</strong> Check if <code>at</code> is present in the given byte input data.<pre><code class=language-python>>>> ip = b'tiger imp goat'
>>> bool(re.search(rb'at', ip))
True
</code></pre><p><strong>7)</strong> For the given input string, display all lines not containing <code>start</code> irrespective of case.<pre><code class=language-python>>>> para = '''good start
... Start working on that
... project you always wanted
... stars are shining brightly
... hi there
... start and try to
... finish the book
... bye'''
>>> pat = re.compile(r'start', flags=re.I)
>>> for line in para.split('\n'):
... if not pat.search(line):
... print(line)
...
project you always wanted
stars are shining brightly
hi there
finish the book
bye
</code></pre><p><strong>8)</strong> For the given list, filter all elements that contain either <code>a</code> or <code>w</code>.<pre><code class=language-python>>>> items = ['goal', 'new', 'user', 'sit', 'eat', 'dinner']
>>> [w for w in items if re.search(r'a', w) or re.search(r'w', w)]
['goal', 'new', 'eat']
</code></pre><p><strong>9)</strong> For the given list, filter all elements that contain both <code>e</code> and <code>n</code>.<pre><code class=language-python>>>> items = ['goal', 'new', 'user', 'sit', 'eat', 'dinner']
>>> [w for w in items if re.search(r'e', w) and re.search(r'n', w)]
['new', 'dinner']
</code></pre><p><strong>10)</strong> For the given string, replace <code>0xA0</code> with <code>0x7F</code> and <code>0xC0</code> with <code>0x1F</code>.<pre><code class=language-python>>>> ip = 'start address: 0xA0, func1 address: 0xC0'
>>> re.sub(r'0xC0', '0x1F', re.sub(r'0xA0', '0x7F', ip))
'start address: 0x7F, func1 address: 0x1F'
</code></pre><br><h1 id=anchors><a class=header href=#anchors>Anchors</a></h1><p><strong>1)</strong> Check if the given strings start with <code>be</code>.<pre><code class=language-python>>>> line1 = 'be nice'
>>> line2 = '"best!"'
>>> line3 = 'better?'
>>> line4 = 'oh no\nbear spotted'
>>> pat = re.compile(r'\Abe')
>>> bool(pat.search(line1))
True
>>> bool(pat.search(line2))
False
>>> bool(pat.search(line3))
True
>>> bool(pat.search(line4))
False
</code></pre><p><strong>2)</strong> For the given input string, change only the whole word <code>red</code> to <code>brown</code>.<pre><code class=language-python>>>> words = 'bred red spread credible red.'
>>> re.sub(r'\bred\b', 'brown', words)
'bred brown spread credible brown.'
</code></pre><p><strong>3)</strong> For the given input list, filter all elements that contain <code>42</code> surrounded by word characters.<pre><code class=language-python>>>> words = ['hi42bye', 'nice1423', 'bad42', 'cool_42a', '42fake', '_42_']
>>> [w for w in words if re.search(r'\B42\B', w)]
['hi42bye', 'nice1423', 'cool_42a', '_42_']
</code></pre><p><strong>4)</strong> For the given input list, filter all elements that start with <code>den</code> or end with <code>ly</code>.<pre><code class=language-python>>>> items = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\n', 'dent']
>>> [e for e in items if re.search(r'\Aden', e) or re.search(r'ly\Z', e)]
['lovely', '2 lonely', 'dent']
</code></pre><p><strong>5)</strong> For the given input string, change whole word <code>mall</code> to <code>1234</code> only if it is at the start of a line.<pre><code class=language-python>>>> para = '''\
... (mall) call ball pall
... ball fall wall tall
... mall call ball pall
... wall mall ball fall
... mallet wallet malls
... mall:call:ball:pall'''
>>> print(re.sub(r'^mall\b', '1234', para, flags=re.M))
(mall) call ball pall
ball fall wall tall
1234 call ball pall
wall mall ball fall
mallet wallet malls
1234:call:ball:pall
</code></pre><p><strong>6)</strong> For the given list, filter all elements having a line starting with <code>den</code> or ending with <code>ly</code>.<pre><code class=language-python>>>> items = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\nfar', 'dent']
>>> [e for e in items if re.search(r'^den', e, flags=re.M) or re.search(r'ly$', e, flags=re.M)]
['lovely', '1\ndentist', '2 lonely', 'fly\nfar', 'dent']
</code></pre><p><strong>7)</strong> For the given input list, filter all whole elements <code>12\nthree</code> irrespective of case.<pre><code class=language-python>>>> items = ['12\nthree\n', '12\nThree', '12\nthree\n4', '12\nthree']
>>> [e for e in items if re.fullmatch(r'12\nthree', e, flags=re.I)]
['12\nThree', '12\nthree']
</code></pre><p><strong>8)</strong> For the given input list, replace <code>hand</code> with <code>X</code> for all elements that start with <code>hand</code> followed by at least one word character.<pre><code class=language-python>>>> items = ['handed', 'hand', 'handy', 'un-handed', 'handle', 'hand-2']
>>> [re.sub(r'\Ahand\B', 'X', w) for w in items]
['Xed', 'hand', 'Xy', 'un-handed', 'Xle', 'hand-2']
</code></pre><p><strong>9)</strong> For the given input list, filter all elements starting with <code>h</code>. Additionally, replace <code>e</code> with <code>X</code> for these filtered elements.<pre><code class=language-python>>>> items = ['handed', 'hand', 'handy', 'unhanded', 'handle', 'hand-2']
>>> [re.sub(r'e', 'X', w) for w in items if re.search(r'\Ah', w)]
['handXd', 'hand', 'handy', 'handlX', 'hand-2']
</code></pre><br><h1 id=alternation-and-grouping><a class=header href=#alternation-and-grouping>Alternation and Grouping</a></h1><p><strong>1)</strong> For the given list, filter all elements that start with <code>den</code> or end with <code>ly</code>.<pre><code class=language-python>>>> items = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\n', 'dent']
>>> [e for e in items if re.search(r'\Aden|ly\Z', e)]
['lovely', '2 lonely', 'dent']
</code></pre><p><strong>2)</strong> For the given list, filter all elements having a line starting with <code>den</code> or ending with <code>ly</code>.<pre><code class=language-python>>>> items = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\nfar', 'dent']
>>> [e for e in items if re.search(r'^den|ly$', e, flags=re.M)]
['lovely', '1\ndentist', '2 lonely', 'fly\nfar', 'dent']
</code></pre><p><strong>3)</strong> For the given strings, replace all occurrences of <code>removed</code> or <code>reed</code> or <code>received</code> or <code>refused</code> with <code>X</code>.<pre><code class=language-python>>>> s1 = 'creed refuse removed read'
>>> s2 = 'refused reed redo received'
>>> pat = re.compile(r're(mov|ceiv|fus|)ed')
>>> pat.sub('X', s1)
'cX refuse X read'
>>> pat.sub('X', s2)
'X X redo X'
</code></pre><p><strong>4)</strong> For the given strings, replace all matches from the list <code>words</code> with <code>A</code>.<pre><code class=language-python>>>> s1 = 'plate full of slate'
>>> s2 = "slated for later, don't be late"
>>> words = ['late', 'later', 'slated']
>>> pat = re.compile('|'.join(sorted(words, key=len, reverse=True)))
>>> pat.sub('A', s1)
'pA full of sA'
>>> pat.sub('A', s2)
"A for A, don't be A"
</code></pre><p><strong>5)</strong> Filter all whole elements from the input list <code>items</code> based on elements listed in <code>words</code>.<pre><code class=language-python>>>> items = ['slate', 'later', 'plate', 'late', 'slates', 'slated ']
>>> words = ['late', 'later', 'slated']
>>> pat = re.compile('|'.join(words))
>>> [w for w in items if pat.fullmatch(w)]
['later', 'late']
</code></pre><br><h1 id=escaping-metacharacters><a class=header href=#escaping-metacharacters>Escaping metacharacters</a></h1><p><strong>1)</strong> Transform the given input strings to the expected output using the same logic on both strings.<pre><code class=language-python>>>> str1 = '(9-2)*5+qty/3-(9-2)*7'
>>> str2 = '(qty+4)/2-(9-2)*5+pq/4'
# easiest solution
>>> str1.replace('(9-2)*5', '35')
'35+qty/3-(9-2)*7'
>>> str2.replace('(9-2)*5', '35')
'(qty+4)/2-35+pq/4'
# if you must do it with 're' module
>>> re.sub(r'\(9-2\)\*5', '35', str1)
'35+qty/3-(9-2)*7'
>>> re.sub(r'\(9-2\)\*5', '35', str2)
'(qty+4)/2-35+pq/4'
</code></pre><p><strong>2)</strong> Replace <code>(4)\|</code> with <code>2</code> only at the start or end of the given input strings.<pre><code class=language-python>>>> s1 = r'2.3/(4)\|6 foo 5.3-(4)\|'
>>> s2 = r'(4)\|42 - (4)\|3'
>>> s3 = 'two - (4)\\|\n'
>>> pat = re.compile(r'\A\(4\)\\\||\(4\)\\\|\Z')
>>> pat.sub('2', s1)
'2.3/(4)\\|6 foo 5.3-2'
>>> pat.sub('2', s2)
'242 - (4)\\|3'
>>> pat.sub('2', s3)
'two - (4)\\|\n'
</code></pre><p><strong>3)</strong> Replace any matching element from the list <code>items</code> with <code>X</code> for the given input strings. Match the elements from <code>items</code> literally. Assume no two elements of <code>items</code> will result in any matching conflict.<pre><code class=language-python>>>> items = ['a.b', '3+n', r'x\y\z', 'qty||price', '{n}']
>>> pat = re.compile('|'.join(re.escape(e) for e in items))
>>> pat.sub('X', '0a.bcd')
'0Xcd'
>>> pat.sub('X', 'E{n}AMPLE')
'EXAMPLE'
>>> pat.sub('X', r'43+n2 ax\y\ze')
'4X2 aXe'
</code></pre><p><strong>4)</strong> Replace the backspace character <code>\b</code> with a single space character for the given input string.<pre><code class=language-python>>>> ip = '123\b456'
>>> ip
'123\x08456'
>>> print(ip)
12456
>>> re.sub(r'\x08', ' ', ip)
'123 456'
</code></pre><p><strong>5)</strong> Replace all occurrences of <code>\e</code> with <code>e</code>.<pre><code class=language-python>>>> ip = r'th\er\e ar\e common asp\ects among th\e alt\ernations'
>>> re.sub(r'\\e', 'e', ip)
'there are common aspects among the alternations'
</code></pre><p><strong>6)</strong> Replace any matching item from the list <code>eqns</code> with <code>X</code> for the given string <code>ip</code>. Match the items from <code>eqns</code> literally.<pre><code class=language-python>>>> ip = '3-(a^b)+2*(a^b)-(a/b)+3'
>>> eqns = ['(a^b)', '(a/b)', '(a^b)+2']
>>> eqns_sorted = sorted(eqns, key=len, reverse=True)
>>> pat = re.compile('|'.join(re.escape(s) for s in eqns_sorted))
>>> pat.sub('X', ip)
'3-X*X-X+3'
</code></pre><br><h1 id=dot-metacharacter-and-quantifiers><a class=header href=#dot-metacharacter-and-quantifiers>Dot metacharacter and Quantifiers</a></h1><blockquote><p><img alt=info src=images/info.svg> Since the <code>.</code> metacharacter doesn't match the newline character by default, assume that the input strings in the following exercises will not contain newline characters.</blockquote><p><strong>1)</strong> Replace <code>42//5</code> or <code>42/5</code> with <code>8</code> for the given input.<pre><code class=language-python>>>> ip = 'a+42//5-c pressure*3+42/5-14256'
>>> re.sub(r'42//?5', '8', ip)
'a+8-c pressure*3+8-14256'
</code></pre><p><strong>2)</strong> For the list <code>items</code>, filter all elements starting with <code>hand</code> and ending immediately with at most one more character or <code>le</code>.<pre><code class=language-python>>>> items = ['handed', 'hand', 'handled', 'handy', 'unhand', 'hands', 'handle']
>>> [w for w in items if re.fullmatch(r'hand(.|le)?', w)]
['hand', 'handy', 'hands', 'handle']
</code></pre><p><strong>3)</strong> Use <code>re.split()</code> to get the output as shown for the given input strings.<pre><code class=language-python>>>> eqn1 = 'a+42//5-c'
>>> eqn2 = 'pressure*3+42/5-14256'
>>> eqn3 = 'r*42-5/3+42///5-42/53+a'
>>> pat = re.compile(r'42//?5')
>>> pat.split(eqn1)
['a+', '-c']
>>> pat.split(eqn2)
['pressure*3+', '-14256']
>>> pat.split(eqn3)
['r*42-5/3+42///5-', '3+a']
</code></pre><p><strong>4)</strong> For the given input strings, remove everything from the first occurrence of <code>i</code> till the end of the string.<pre><code class=language-python>>>> s1 = 'remove the special meaning of such constructs'
>>> s2 = 'characters while constructing'
>>> s3 = 'input output'
>>> pat = re.compile(r'i.*')
>>> pat.sub('', s1)
'remove the spec'
>>> pat.sub('', s2)
'characters wh'
>>> pat.sub('', s3)
''
</code></pre><p><strong>5)</strong> For the given strings, construct a RE to get the output as shown below.<pre><code class=language-python>>>> str1 = 'a+b(addition)'
>>> str2 = 'a/b(division) + c%d(#modulo)'
>>> str3 = 'Hi there(greeting). Nice day(a(b)'
>>> remove_parentheses = re.compile(r'\(.*?\)')
>>> remove_parentheses.sub('', str1)
'a+b'
>>> remove_parentheses.sub('', str2)
'a/b + c%d'
>>> remove_parentheses.sub('', str3)
'Hi there. Nice day'
</code></pre><p><strong>6)</strong> Correct the given RE to get the expected output.<pre><code class=language-python>>>> words = 'plink incoming tint winter in caution sentient'
>>> change = re.compile(r'int|in|ion|ing|inco|inter|ink')
# wrong output
>>> change.sub('X', words)
'plXk XcomXg tX wXer X cautX sentient'
# expected output
>>> change = re.compile(r'in(ter|co|t|g|k)?|ion')
>>> change.sub('X', words)
'plX XmX tX wX X cautX sentient'
</code></pre><p><strong>7)</strong> For the given greedy quantifiers, what would be the equivalent form using the <code>{m,n}</code> representation?<ul><li><code>?</code> is same as <code>{,1}</code><li><code>*</code> is same as <code>{0,}</code><li><code>+</code> is same as <code>{1,}</code></ul><p><strong>8)</strong> <code>(a*|b*)</code> is same as <code>(a|b)*</code> — True or False?<p>False. Because <code>(a*|b*)</code> will match only sequences like <code>a</code>, <code>aaa</code>, <code>bb</code>, <code>bbbbbbbb</code>. But <code>(a|b)*</code> can match mixed sequences like <code>ababbba</code> too.<p><strong>9)</strong> For the given input strings, remove everything from the first occurrence of <code>test</code> (irrespective of case) till the end of the string, provided <code>test</code> isn't at the end of the string.<pre><code class=language-python>>>> s1 = 'this is a Test'
>>> s2 = 'always test your RE for corner cases'
>>> s3 = 'a TEST of skill tests?'
>>> pat = re.compile(r'test.+', flags=re.I)
>>> pat.sub('', s1)
'this is a Test'
>>> pat.sub('', s2)
'always '
>>> pat.sub('', s3)
'a '
</code></pre><p><strong>10)</strong> For the input list <code>words</code>, filter all elements starting with <code>s</code> and containing <code>e</code> and <code>t</code> in any order.<pre><code class=language-python>>>> words = ['sequoia', 'subtle', 'exhibit', 'a set', 'sets', 'tests', 'site']
>>> [w for w in words if re.search(r'\As.*(e.*t|t.*e)', w)]
['subtle', 'sets', 'site']
</code></pre><p><strong>11)</strong> For the input list <code>words</code>, remove all elements having less than <code>6</code> characters.<pre><code class=language-python>>>> words = ['sequoia', 'subtle', 'exhibit', 'asset', 'sets', 'tests', 'site']
>>> [w for w in words if re.search(r'.{6,}', w)]
['sequoia', 'subtle', 'exhibit']
</code></pre><p><strong>12)</strong> For the input list <code>words</code>, filter all elements starting with <code>s</code> or <code>t</code> and having a maximum of <code>6</code> characters.<pre><code class=language-python>>>> words = ['sequoia', 'subtle', 'exhibit', 'asset', 'sets', 't set', 'site']
>>> [w for w in words if re.fullmatch(r'(s|t).{,5}', w)]
['subtle', 'sets', 't set', 'site']
</code></pre><p><strong>13)</strong> Can you reason out why this code results in the output shown? The aim was to remove all <code><characters></code> patterns but not the <code><></code> ones. The expected result was <code>'a 1<> b 2<> c'</code>.<p>The use of <code>.+</code> quantifier after <code><</code> means that <code><></code> cannot be a possible match to satisfy <code><.+?></code>. So, after matching <code><</code> (which occurs after <code>1</code> and <code>2</code> in the given input string) the regular expression engine will look for next occurrence of <code>></code> character to satisfy the given pattern. To solve such cases, you need to use character classes (discussed in a later chapter) to specify which particular set of characters should be matched by the <code>+</code> quantifier (instead of the <code>.</code> metacharacter).<pre><code class=language-python>>>> ip = 'a<apple> 1<> b<bye> 2<> c<cat>'
>>> re.sub(r'<.+?>', '', ip)
'a 1 2'
</code></pre><p><strong>14)</strong> Use <code>re.split()</code> to get the output as shown below for given input strings.<pre><code class=language-python>>>> s1 = 'go there // "this // that"'
>>> s2 = 'a//b // c//d e//f // 4//5'
>>> s3 = '42// hi//bye//see // carefully'
>>> pat = re.compile(r' +// +')
>>> pat.split(s1, maxsplit=1)
['go there', '"this // that"']
>>> pat.split(s2, maxsplit=1)
['a//b', 'c//d e//f // 4//5']
>>> pat.split(s3, maxsplit=1)
['42// hi//bye//see', 'carefully']
</code></pre><p><strong>15)</strong> Modify the given regular expression such that it gives the expected results.<pre><code class=language-python>>>> s1 = 'appleabcabcabcapricot'
>>> s2 = 'bananabcabcabcdelicious'
# wrong output
>>> pat = re.compile(r'(abc)+a')
>>> bool(pat.search(s1))
True
>>> bool(pat.search(s2))
True
# expected output
# 'abc' shouldn't be considered when trying to match 'a' at the end
>>> pat = re.compile(r'(abc)++a')
>>> bool(pat.search(s1))
True
>>> bool(pat.search(s2))
False
</code></pre><p><strong>16)</strong> Modify the given regular expression such that it gives the expected result.<pre><code class=language-python>>>> cast = 'dragon-unicorn--centaur---mage----healer'
>>> c = '-'
# wrong output
>>> re.sub(rf'{c}{3,}', c, cast)
'dragon-unicorn--centaur---mage----healer'
# expected output
>>> re.sub(rf'{c}{{3,}}', c, cast)
'dragon-unicorn--centaur-mage-healer'
</code></pre><br><h1 id=working-with-matched-portions><a class=header href=#working-with-matched-portions>Working with matched portions</a></h1><p><strong>1)</strong> For the given strings, extract the matching portion from the first <code>is</code> to the last <code>t</code>.<pre><code class=language-python>>>> str1 = 'This the biggest fruit you have seen?'
>>> str2 = 'Your mission is to read and practice consistently'
>>> pat = re.compile(r'is.*t')
>>> pat.search(str1)[0]
'is the biggest fruit'
>>> pat.search(str2)[0]
'ission is to read and practice consistent'
</code></pre><p><strong>2)</strong> Find the starting index of the first occurrence of <code>is</code> or <code>the</code> or <code>was</code> or <code>to</code> for the given input strings.<pre><code class=language-python>>>> s1 = 'match after the last newline character'
>>> s2 = 'and then you want to test'
>>> s3 = 'this is good bye then'
>>> s4 = 'who was there to see?'
>>> pat = re.compile(r'is|the|was|to')
>>> pat.search(s1).start()
12
>>> pat.search(s2).start()
4
>>> pat.search(s3).start()
2
>>> pat.search(s4).start()
4
</code></pre><p><strong>3)</strong> Find the starting index of the last occurrence of <code>is</code> or <code>the</code> or <code>was</code> or <code>to</code> for the given input strings.<pre><code class=language-python>>>> s1 = 'match after the last newline character'
>>> s2 = 'and then you want to test'
>>> s3 = 'this is good bye then'
>>> s4 = 'who was there to see?'
>>> pat = re.compile(r'.*(is|the|was|to)')
>>> pat.search(s1).start(1)
12
>>> pat.search(s2).start(1)
18
>>> pat.search(s3).start(1)
17
>>> pat.search(s4).start(1)
14
</code></pre><p><strong>4)</strong> The given input string contains <code>:</code> exactly once. Extract all characters after the <code>:</code> as output.<pre><code class=language-python>>>> ip = 'fruits:apple, mango, guava, blueberry'
>>> re.search(r':(.*)', ip)[1]
'apple, mango, guava, blueberry'
</code></pre><p><strong>5)</strong> The given input strings contain some text followed by <code>-</code> followed by a number. Replace that number with its <code>log</code> value using <code>math.log()</code>.<pre><code class=language-python>>>> s1 = 'first-3.14'
>>> s2 = 'next-123'
>>> pat = re.compile(r'-(.*)')
>>> import math
>>> pat.sub(lambda m: '-' + str(math.log(float(m[1]))), s1)
'first-1.144222799920162'
>>> pat.sub(lambda m: '-' + str(math.log(float(m[1]))), s2)
'next-4.812184355372417'
</code></pre><p><strong>6)</strong> Replace all occurrences of <code>par</code> with <code>spar</code>, <code>spare</code> with <code>extra</code> and <code>park</code> with <code>garden</code> for the given input strings.<pre><code class=language-python>>>> str1 = 'apartment has a park'
>>> str2 = 'do you have a spare cable'
>>> str3 = 'write a parser'
>>> pat = re.compile(r'park?|spare')
>>> d = {'par': 'spar', 'spare': 'extra', 'park': 'garden'}
>>> pat.sub(lambda m: d[m[0]], str1)
'aspartment has a garden'
>>> pat.sub(lambda m: d[m[0]], str2)
'do you have a extra cable'
>>> pat.sub(lambda m: d[m[0]], str3)
'write a sparser'
</code></pre><p><strong>7)</strong> Extract all words between <code>(</code> and <code>)</code> from the given input string as a list. Assume that the input will not contain any broken parentheses.<pre><code class=language-python>>>> ip = 'another (way) to reuse (portion) matched (by) capture groups'
>>> re.findall(r'\((.*?)\)', ip)
['way', 'portion', 'by']
</code></pre><p><strong>8)</strong> Extract all occurrences of <code><</code> up to the next occurrence of <code>></code>, provided there is at least one character in between <code><</code> and <code>></code>.<pre><code class=language-python>>>> ip = 'a<apple> 1<> b<bye> 2<> c<cat>'
>>> re.findall(r'<.+?>', ip)
['<apple>', '<> b<bye>', '<> c<cat>']
</code></pre><p><strong>9)</strong> Use <code>re.findall()</code> to get the output as shown below for the given input strings. Note the characters used in the input strings carefully.<pre><code class=language-python>>>> row1 = '-2,5 4,+3 +42,-53 4356246,-357532354 '
>>> row2 = '1.32,-3.14 634,5.63 63.3e3,9907809345343.235 '
>>> pat = re.compile(r'(.+?),(.+?) ')
>>> pat.findall(row1)
[('-2', '5'), ('4', '+3'), ('+42', '-53'), ('4356246', '-357532354')]
>>> pat.findall(row2)
[('1.32', '-3.14'), ('634', '5.63'), ('63.3e3', '9907809345343.235')]
</code></pre><p><strong>10)</strong> This is an extension to the previous question.<ul><li>For <code>row1</code>, find the sum of integers of each tuple element. For example, sum of <code>-2</code> and <code>5</code> is <code>3</code>.<li>For <code>row2</code>, find the sum of floating-point numbers of each tuple element. For example, sum of <code>1.32</code> and <code>-3.14</code> is <code>-1.82</code>.</ul><pre><code class=language-python>>>> row1 = '-2,5 4,+3 +42,-53 4356246,-357532354 '
>>> row2 = '1.32,-3.14 634,5.63 63.3e3,9907809345343.235 '
# should be the same as previous question
>>> pat = re.compile(r'(.+?),(.+?) ')
>>> [int(m[1]) + int(m[2]) for m in pat.finditer(row1)]
[3, 7, -11, -353176108]
>>> [float(m[1]) + float(m[2]) for m in pat.finditer(row2)]
[-1.82, 639.63, 9907809408643.234]
</code></pre><p><strong>11)</strong> Use <code>re.split()</code> to get the output as shown below.<pre><code class=language-python>>>> ip = '42:no-output;1000:car-tr:u-ck;SQEX49801'
>>> re.split(r':.+?-(.+?);', ip)
['42', 'output', '1000', 'tr:u-ck', 'SQEX49801']
</code></pre><p><strong>12)</strong> For the given list of strings, change the elements into a tuple of original element and the number of times <code>t</code> occurs in that element.<pre><code class=language-python>>>> words = ['sequoia', 'attest', 'tattletale', 'asset']
>>> [re.subn(r't', 't', w) for w in words]
[('sequoia', 0), ('attest', 3), ('tattletale', 4), ('asset', 1)]
</code></pre><p><strong>13)</strong> The given input string has fields separated by <code>:</code>. Each field contains four uppercase alphabets followed optionally by two digits. Ignore the last field, which is empty. See <a href=https://docs.python.org/3/library/re.html#re.Match.groups>docs.python: Match.groups</a> and use <code>re.finditer()</code> to get the output as shown below. If the optional digits aren't present, show <code>'NA'</code> instead of <code>None</code>.<pre><code class=language-python>>>> ip = 'TWXA42:JWPA:NTED01:'
>>> [m.groups(default='NA') for m in re.finditer(r'(.{4})(..)?:', ip)]
[('TWXA', '42'), ('JWPA', 'NA'), ('NTED', '01')]
</code></pre><blockquote><p><img alt=info src=images/info.svg> Note that this is different from <code>re.findall()</code> which will just give empty string instead of <code>None</code> when a capture group doesn't participate.</blockquote><p><strong>14)</strong> Convert the comma separated strings to corresponding <code>dict</code> objects as shown below.<pre><code class=language-python>>>> row1 = 'name:rohan,maths:75,phy:89,'
>>> row2 = 'name:rose,maths:88,phy:92,'
>>> pat = re.compile(r'(.+?):(.+?),')
# can also use dict(pat.findall(row1))
>>> {m[1]:m[2] for m in pat.finditer(row1)}
{'name': 'rohan', 'maths': '75', 'phy': '89'}
# can also use dict(pat.findall(row2))
>>> {m[1]:m[2] for m in pat.finditer(row2)}
{'name': 'rose', 'maths': '88', 'phy': '92'}
</code></pre><br><h1 id=character-class><a class=header href=#character-class>Character class</a></h1><p><strong>1)</strong> For the list <code>items</code>, filter all elements starting with <code>hand</code> and ending immediately with <code>s</code> or <code>y</code> or <code>le</code>.<pre><code class=language-python>>>> items = ['-handy', 'hand', 'handy', 'unhand', 'hands', 'hand-icy', 'handle']
>>> [w for w in items if re.fullmatch(r'hand([sy]|le)', w)]
['handy', 'hands', 'handle']
</code></pre><p><strong>2)</strong> Replace all whole words <code>reed</code> or <code>read</code> or <code>red</code> with <code>X</code>.<pre><code class=language-python>>>> ip = 'redo red credible :read: rod reed'
>>> re.sub(r'\bre[ae]?d\b', 'X', ip)
'redo X credible :X: rod X'
</code></pre><p><strong>3)</strong> For the list <code>words</code>, filter all elements containing <code>e</code> or <code>i</code> followed by <code>l</code> or <code>n</code>. Note that the order mentioned should be followed.<pre><code class=language-python>>>> words = ['surrender', 'unicorn', 'newer', 'door', 'empty', 'eel', 'pest']
>>> [w for w in words if re.search(r'[ei].*[ln]', w)]
['surrender', 'unicorn', 'eel']
</code></pre><p><strong>4)</strong> For the list <code>words</code>, filter all elements containing <code>e</code> or <code>i</code> and <code>l</code> or <code>n</code> in any order.<pre><code class=language-python>>>> words = ['surrender', 'unicorn', 'newer', 'door', 'empty', 'eel', 'pest']
>>> [w for w in words if re.search(r'[ei].*[ln]|[ln].*[ei]', w)]
['surrender', 'unicorn', 'newer', 'eel']
</code></pre><p><strong>5)</strong> Extract all hex character sequences, with <code>0x</code> optional prefix. Match the characters case insensitively, and the sequences shouldn't be surrounded by other word characters.<pre><code class=language-python>>>> str1 = '128A foo 0xfe32 34 0xbar'
>>> str2 = '0XDEADBEEF place 0x0ff1ce bad'
>>> hex_seq = re.compile(r'\b(0x)?[\da-f]+\b', flags=re.I)
>>> [m[0] for m in hex_seq.finditer(str1)]
['128A', '0xfe32', '34']
>>> [m[0] for m in hex_seq.finditer(str2)]
['0XDEADBEEF', '0x0ff1ce', 'bad']
</code></pre><p><strong>6)</strong> Delete from <code>(</code> to the next occurrence of <code>)</code> unless they contain parentheses characters in between.<pre><code class=language-python>>>> str1 = 'def factorial()'
>>> str2 = 'a/b(division) + c%d(#modulo) - (e+(j/k-3)*4)'
>>> str3 = 'Hi there(greeting). Nice day(a(b)'
>>> remove_parentheses = re.compile(r'\([^()]*\)')
>>> remove_parentheses.sub('', str1)
'def factorial'
>>> remove_parentheses.sub('', str2)
'a/b + c%d - (e+*4)'
>>> remove_parentheses.sub('', str3)
'Hi there. Nice day(a'
</code></pre><p><strong>7)</strong> For the list <code>words</code>, filter all elements not starting with <code>e</code> or <code>p</code> or <code>u</code>.<pre><code class=language-python>>>> words = ['surrender', 'unicorn', 'newer', 'door', 'empty', 'eel', '(pest)']
>>> [w for w in words if re.search(r'\A[^epu]', w)]
['surrender', 'newer', 'door', '(pest)']
</code></pre><p><strong>8)</strong> For the list <code>words</code>, filter all elements not containing <code>u</code> or <code>w</code> or <code>ee</code> or <code>-</code>.<pre><code class=language-python>>>> words = ['p-t', 'you', 'tea', 'heel', 'owe', 'new', 'reed', 'ear']
>>> [w for w in words if not re.search(r'[uw-]|ee', w)]
['tea', 'ear']
</code></pre><p><strong>9)</strong> The given input strings contain fields separated by <code>,</code> and fields can be empty too. Replace the last three fields with <code>WHTSZ323</code>.<pre><code class=language-python>>>> row1 = '(2),kite,12,,D,C,,'
>>> row2 = 'hi,bye,sun,moon'
>>> pat = re.compile(r'(,[^,]*){3}\Z')
>>> pat.sub(',WHTSZ323', row1)
'(2),kite,12,,D,WHTSZ323'
>>> pat.sub(',WHTSZ323', row2)
'hi,WHTSZ323'
</code></pre><p><strong>10)</strong> Split the given strings based on consecutive sequence of digit or whitespace characters.<pre><code class=language-python>>>> str1 = 'lion \t Ink32onion Nice'
>>> str2 = '**1\f2\n3star\t7 77\r**'
>>> pat = re.compile(r'[\d\s]+')
>>> pat.split(str1)
['lion', 'Ink', 'onion', 'Nice']
>>> pat.split(str2)
['**', 'star', '**']
</code></pre><p><strong>11)</strong> Delete all occurrences of the sequence <code><characters></code> where <code>characters</code> is one or more non <code>></code> characters and cannot be empty.<pre><code class=language-python>>>> ip = 'a<ap\nple> 1<> b<bye> 2<> c<cat>'
>>> re.sub(r'<[^>]+>', '', ip)
'a 1<> b 2<> c'
</code></pre><p><strong>12)</strong> <code>\b[a-z](on|no)[a-z]\b</code> is same as <code>\b[a-z][on]{2}[a-z]\b</code>. True or False? Sample input lines shown below might help to understand the differences, if any.<p>False. <code>[on]{2}</code> will also match <code>oo</code> and <code>nn</code>.<pre><code class=language-python>>>> print('known\nmood\nknow\npony\ninns')
known
mood
know
pony
inns
</code></pre><p><strong>13)</strong> For the given list, filter all elements containing any number sequence greater than <code>624</code>.<pre><code class=language-python>>>> items = ['hi0000432abcd', 'car00625', '42_624 0512', '3.14 96 2 foo1234baz']
>>> [e for e in items if any(int(m[0])>624 for m in re.finditer(r'\d+', e))]
['car00625', '3.14 96 2 foo1234baz']
</code></pre><p><strong>14)</strong> Count the maximum depth of nested braces for the given strings. Unbalanced or wrongly ordered braces should return <code>-1</code>. Note that this will require a mix of regular expressions and Python code.<pre><code class=language-python>>>> def max_nested_braces(ip):
... count = 0
... while (op := re.subn(r'\{[^{}]*\}', '', ip))[1]:
... count += 1
... ip = op[0]
... if re.search(r'[{}]', ip):
... return -1
... return count
...
>>> max_nested_braces('a*b')
0
>>> max_nested_braces('}a+b{')
-1
>>> max_nested_braces('a*b+{}')
1
>>> max_nested_braces('{{a+2}*{b+c}+e}')
2
>>> max_nested_braces('{{a+2}*{b+{c*d}}+e}')
3
>>> max_nested_braces('{{a+2}*{\n{b+{c*d}}+e*d}}')
4
>>> max_nested_braces('a*{b+c*{e*3.14}}}')
-1
</code></pre><p><strong>15)</strong> By default, the <code>str.split()</code> method will split on whitespace and remove empty strings from the result. Which <code>re</code> module function would you use to replicate this functionality?<pre><code class=language-python>>>> ip = ' \t\r so pole\t\t\t\n\nlit in to \r\n\v\f '
>>> ip.split()
['so', 'pole', 'lit', 'in', 'to']
>>> re.findall(r'\S+', ip)
['so', 'pole', 'lit', 'in', 'to']
</code></pre><p><strong>16)</strong> Convert the given input string to two different lists as shown below.<pre><code class=language-python>>>> ip = 'price_42 roast^\t\n^-ice==cat\neast'
>>> re.split(r'\W+', ip)
['price_42', 'roast', 'ice', 'cat', 'east']
>>> re.split(r'(\W+)', ip)
['price_42', ' ', 'roast', '^\t\n^-', 'ice', '==', 'cat', '\n', 'east']
</code></pre><p><strong>17)</strong> Filter all whole elements with optional whitespaces at the start followed by three to five non-digit characters. Whitespaces at the start should not be part of the calculation for non-digit characters.<pre><code class=language-python>>>> items = ['\t \ncat', 'goal', ' oh', 'he-he', 'goal2', 'ok ', 'sparrow']
# if possessive quantifiers aren't supported: r'\s*[^\d\s]\D{2,4}'
>>> [e for e in items if re.fullmatch(r'\s*+\D{3,5}', e)]
['\t \ncat', 'goal', 'he-he', 'ok ']
</code></pre><br><h1 id=groupings-and-backreferences><a class=header href=#groupings-and-backreferences>Groupings and backreferences</a></h1><p><strong>1)</strong> Replace the space character that occurs after a word ending with <code>a</code> or <code>r</code> with a newline character.<pre><code class=language-python>>>> ip = 'area not a _a2_ roar took 22'
>>> print(re.sub(r'([ar]) ', r'\1\n', ip))
area
not a
_a2_ roar
took 22
</code></pre><p><strong>2)</strong> Add <code>[]</code> around words starting with <code>s</code> and containing <code>e</code> and <code>t</code> in any order.<pre><code class=language-python>>>> ip = 'sequoia subtle exhibit asset sets2 tests si_te'
>>> re.sub(r'\bs\w*(t\w*e|e\w*t)\w*', r'[\g<0>]', ip)
'sequoia [subtle] exhibit asset [sets2] tests [si_te]'
</code></pre><p><strong>3)</strong> Replace all whole words with <code>X</code> that start and end with the same word character (irrespective of case). Single character word should get replaced with <code>X</code> too, as it satisfies the stated condition.<pre><code class=language-python>>>> ip = 'oreo not a _a2_ Roar took 22'
# can also use: re.sub(r'\b(\w|(\w)\w*\2)\b', 'X', ip, flags=re.I)
>>> re.sub(r'\b(\w)(\w*\1)?\b', 'X', ip, flags=re.I)
'X not X X X took X'
</code></pre><p><strong>4)</strong> Convert the given <em>markdown</em> headers to corresponding <em>anchor</em> tags. Consider the input to start with one or more <code>#</code> characters followed by space and word characters. The <code>name</code> attribute is constructed by converting the header to lowercase and replacing spaces with hyphens. Can you do it without using a capture group?<pre><code class=language-python>>>> header1 = '# Regular Expressions'
>>> header2 = '## Compiling regular expressions'
>>> anchor = re.compile(r'\w.*')
>>> def hyphenify(m):
... return f'<a name="{m[0].lower().replace(" ", "-")}"></a>{m[0]}'
...
>>> anchor.sub(hyphenify, header1)
'# <a name="regular-expressions"></a>Regular Expressions'
>>> anchor.sub(hyphenify, header2)
'## <a name="compiling-regular-expressions"></a>Compiling regular expressions'
</code></pre><p><strong>5)</strong> Convert the given <em>markdown</em> anchors to corresponding <em>hyperlinks</em>.<pre><code class=language-python>>>> anchor1 = '# <a name="regular-expressions"></a>Regular Expressions'
>>> anchor2 = '## <a name="subexpression-calls"></a>Subexpression calls'
>>> hyperlink = re.compile(r'[^"]+"([^"]+)"></a>(.+)')
>>> hyperlink.sub(r'[\2](#\1)', anchor1)
'[Regular Expressions](#regular-expressions)'
>>> hyperlink.sub(r'[\2](#\1)', anchor2)
'[Subexpression calls](#subexpression-calls)'
</code></pre><p><strong>6)</strong> Count the number of whole words that have at least two occurrences of consecutive repeated alphabets. For example, words like <code>stillness</code> and <code>Committee</code> should be counted but not words like <code>root</code> or <code>readable</code> or <code>rotational</code>.<pre><code class=language-python>>>> ip = '''oppressed abandon accommodation bloodless
... carelessness committed apparition innkeeper
... occasionally afforded embarrassment foolishness
... depended successfully succeeded
... possession cleanliness suppress'''
# can also use: r'\b\w*(\w)\1\w*(\w)\2\w*\b'
>>> len(re.findall(r'\b(\w*(\w)\2){2}\w*\b', ip))
13
</code></pre><p><strong>7)</strong> For the given input string, replace all occurrences of digit sequences with only the unique non-repeating sequence. For example, <code>232323</code> should be changed to <code>23</code> and <code>897897</code> should be changed to <code>897</code>. If there are no repeats (for example <code>1234</code>) or if the repeats end prematurely (for example <code>12121</code>), it should not be changed.<pre><code class=language-python>>>> ip = '1234 2323 453545354535 9339 11 60260260'
>>> re.sub(r'\b(\d+)\1+\b', r'\1', ip)
'1234 23 4535 9339 1 60260260'
</code></pre><p><strong>8)</strong> Replace sequences made up of words separated by <code>:</code> or <code>.</code> by the first word of the sequence. Such sequences will end when <code>:</code> or <code>.</code> is not followed by a word character.<pre><code class=language-python>>>> ip = 'wow:Good:2_two.five: hi-2 bye kite.777:water.'
>>> re.sub(r'([:.]\w*)+', '', ip)
'wow hi-2 bye kite'
</code></pre><p><strong>9)</strong> Replace sequences made up of words separated by <code>:</code> or <code>.</code> by the last word of the sequence. Such sequences will end when <code>:</code> or <code>.</code> is not followed by a word character.<pre><code class=language-python>>>> ip = 'wow:Good:2_two.five: hi-2 bye kite.777:water.'
>>> re.sub(r'((\w+)[:.])+', r'\2', ip)
'five hi-2 bye water'
</code></pre><p><strong>10)</strong> Split the given input string on one or more repeated sequence of <code>cat</code>.<pre><code class=language-python>>>> ip = 'firecatlioncatcatcatbearcatcatparrot'
>>> re.split(r'(?:cat)+', ip)
['fire', 'lion', 'bear', 'parrot']
</code></pre><p><strong>11)</strong> For the given input string, find all occurrences of digit sequences with at least one repeating sequence. For example, <code>232323</code> and <code>897897</code>. If the repeats end prematurely, for example <code>12121</code>, it should not be matched.<pre><code class=language-python>>>> ip = '1234 2323 453545354535 9339 11 60260260'
>>> pat = re.compile(r'\b(\d+)\1+\b')
# entire sequences in the output
>>> [m[0] for m in pat.finditer(ip)]
['2323', '453545354535', '11']
# only the unique sequence in the output
>>> pat.findall(ip)
['23', '4535', '1']
</code></pre><p><strong>12)</strong> Convert the comma separated strings to corresponding <code>dict</code> objects as shown below. The keys are <code>name</code>, <code>maths</code> and <code>phy</code> for the three fields in the input strings.<pre><code class=language-python>>>> row1 = 'rohan,75,89'
>>> row2 = 'rose,88,92'
>>> pat = re.compile(r'(?P<name>[^,]+),(?P<maths>[^,]+),(?P<phy>[^,]+)')
>>> pat.search(row1).groupdict()
{'name': 'rohan', 'maths': '75', 'phy': '89'}
>>> pat.search(row2).groupdict()
{'name': 'rose', 'maths': '88', 'phy': '92'}
</code></pre><p><strong>13)</strong> Surround all whole words with <code>()</code>. Additionally, if the whole word is <code>imp</code> or <code>ant</code>, delete them. Can you do it with just a single substitution?<pre><code class=language-python>>>> ip = 'tiger imp goat eagle ant important'
>>> re.sub(r'\b(?:imp|ant|(\w+))\b', r'(\1)', ip)
'(tiger) () (goat) (eagle) () (important)'
</code></pre><p><strong>14)</strong> Filter all elements that contain a sequence of lowercase alphabets followed by <code>-</code> followed by digits. They can be optionally surrounded by <code>{{</code> and <code>}}</code>. Any partial match shouldn't be part of the output.<pre><code class=language-python>>>> ip = ['{{apple-150}}', '{{mango2-100}}', '{{cherry-200', 'grape-87']
>>> [w for w in ip if re.fullmatch(r'({{)?[a-z]+-\d+(?(1)}})', w)]
['{{apple-150}}', 'grape-87']
</code></pre><p><strong>15)</strong> The given input string has sequences made up of words separated by <code>:</code> or <code>.</code> and such sequences will end when <code>:</code> or <code>.</code> is not followed by a word character. For all such sequences, display only the last word followed by <code>-</code> followed by the first word.<pre><code class=language-python>>>> ip = 'wow:Good:2_two.five: hi-2 bye kite.777:water.'
# can also use f'{m[2]}-{m[1]}' instead of m.expand(r'\2-\1')
>>> [m.expand(r'\2-\1') for m in re.finditer(r'(\w+)[:.](?:(\w+)[:.])+', ip)]
['five-wow', 'water-kite']
</code></pre><p><strong>16)</strong> Modify the given regular expression such that it gives the expected result.<pre><code class=language-python>>>> ip = '( S:12 E:5 S:4 and E:123 ok S:100 & E:10 S:1 - E:2 S:42 E:43 )'
# wrong output
>>> re.findall(r'S:\d+.*?E:\d{2,}', ip)
['S:12 E:5 S:4 and E:123', 'S:100 & E:10', 'S:1 - E:2 S:42 E:43']
# expected output
>>> re.findall(r'(?>S:\d+.*?E:)\d{2,}', ip)
['S:4 and E:123', 'S:100 & E:10', 'S:42 E:43']
</code></pre><br><h1 id=lookarounds><a class=header href=#lookarounds>Lookarounds</a></h1><blockquote><p><img alt=info src=images/info.svg> Please use lookarounds for solving the following exercises even if you can do it without lookarounds. Unless you cannot use lookarounds for cases like variable length lookbehinds.</blockquote><p><strong>1)</strong> Replace all whole words with <code>X</code> unless it is preceded by a <code>(</code> character.<pre><code class=language-python>>>> ip = '(apple) guava berry) apple (mango) (grape'
>>> re.sub(r'(?<!\()\b\w+', 'X', ip)
'(apple) X X) X (mango) (grape'
</code></pre><p><strong>2)</strong> Replace all whole words with <code>X</code> unless it is followed by a <code>)</code> character.<pre><code class=language-python>>>> ip = '(apple) guava berry) apple (mango) (grape'
>>> re.sub(r'\w+\b(?!\))', 'X', ip)
'(apple) X berry) X (mango) (X'
</code></pre><p><strong>3)</strong> Replace all whole words with <code>X</code> unless it is preceded by <code>(</code> or followed by <code>)</code> characters.<pre><code class=language-python>>>> ip = '(apple) guava berry) apple (mango) (grape'
>>> re.sub(r'(?<!\()\b\w+\b(?!\))', 'X', ip)
'(apple) X berry) X (mango) (grape'
</code></pre><p><strong>4)</strong> Extract all whole words that do not end with <code>e</code> or <code>n</code>.<pre><code class=language-python>>>> ip = 'a_t row on Urn e note Dust n end a2-e|u'
>>> re.findall(r'\b\w+\b(?<![en])', ip)
['a_t', 'row', 'Dust', 'end', 'a2', 'u']
</code></pre><p><strong>5)</strong> Extract all whole words that do not start with <code>a</code> or <code>d</code> or <code>n</code>.<pre><code class=language-python>>>> ip = 'a_t row on Urn e note Dust n end a2-e|u'
>>> re.findall(r'(?![adn])\b\w+', ip)
['row', 'on', 'Urn', 'e', 'Dust', 'end', 'e', 'u']
</code></pre><p><strong>6)</strong> Extract all whole words only if they are followed by <code>:</code> or <code>,</code> or <code>-</code>.<pre><code class=language-python>>>> ip = 'Poke,on=-=so_good:ink.to/is(vast)ever2-sit'
>>> re.findall(r'\w+(?=[:,-])', ip)
['Poke', 'so_good', 'ever2']
</code></pre><p><strong>7)</strong> Extract all whole words only if they are preceded by <code>=</code> or <code>/</code> or <code>-</code>.<pre><code class=language-python>>>> ip = 'Poke,on=-=so_good:ink.to/is(vast)ever2-sit'
>>> re.findall(r'(?<=[=/-])\w+', ip)
['so_good', 'is', 'sit']
</code></pre><p><strong>8)</strong> Extract all whole words only if they are preceded by <code>=</code> or <code>:</code> and followed by <code>:</code> or <code>.</code>.<pre><code class=language-python>>>> ip = 'Poke,on=-=so_good:ink.to/is(vast)ever2-sit'
>>> re.findall(r'(?<=[=:])\w+(?=[:.])', ip)
['so_good', 'ink']
</code></pre><p><strong>9)</strong> Extract all whole words only if they are preceded by <code>=</code> or <code>:</code> or <code>.</code> or <code>(</code> or <code>-</code> and not followed by <code>.</code> or <code>/</code>.<pre><code class=language-python>>>> ip = 'Poke,on=-=so_good:ink.to/is(vast)ever2-sit'
>>> re.findall(r'(?<=[=:.(-])\w+\b(?![/.])', ip)
['so_good', 'vast', 'sit']
</code></pre><p><strong>10)</strong> Remove the leading and trailing whitespaces from all the individual fields where <code>,</code> is the field separator.<pre><code class=language-python>>>> csv1 = ' comma ,separated ,values \t\r '
>>> csv2 = 'good bad,nice ice , 42 , , stall small'
>>> remove_whitespace = re.compile(r'(?<![^,])\s+|\s+(?![^,])')
>>> remove_whitespace.sub('', csv1)
'comma,separated,values'
>>> remove_whitespace.sub('', csv2)
'good bad,nice ice,42,,stall small'
</code></pre><p><strong>11)</strong> Filter all elements that satisfy all of these rules:<ul><li>should have at least two alphabets<li>should have at least three digits<li>should have at least one special character among <code>%</code> or <code>*</code> or <code>#</code> or <code>$</code><li>should not end with a whitespace character</ul><pre><code class=language-python>>>> pwds = ['hunter2', 'F2H3u%9', '*X3Yz3.14\t', 'r2_d2_42', 'A $B C1234']
>>> rule_chk = re.compile(r'(?=(.*[a-zA-Z]){2})(?=(.*\d){3})(?!.+\s\Z).*[%*#$]')
>>> [p for p in pwds if rule_chk.search(p)]
['F2H3u%9', 'A $B C1234']
</code></pre><p><strong>12)</strong> For the given string, surround all whole words with <code>{}</code> except for whole words <code>par</code> and <code>cat</code> and <code>apple</code>.<pre><code class=language-python>>>> ip = 'part; cat {super} rest_42 par scatter apple spar'
>>> re.sub(r'\b(?!(?:par|cat|apple)\b)\w+', r'{\g<0>}', ip)
'{part}; cat {{super}} {rest_42} par {scatter} apple {spar}'
</code></pre><p><strong>13)</strong> Extract integer portion of floating-point numbers for the given string. Integers and numbers ending with <code>.</code> and no further digits should not be considered.<pre><code class=language-python>>>> ip = '12 ab32.4 go 5 2. 46.42 5'
>>> re.findall(r'\d+(?=\.\d)', ip)
['32', '46']
</code></pre><p><strong>14)</strong> For the given input strings, extract all overlapping two character sequences.<pre><code class=language-python>>>> s1 = 'apple'
>>> s2 = '1.2-3:4'
>>> pat = re.compile(r'.(?=(.))')
>>> [m[0]+m[1] for m in pat.finditer(s1)]
['ap', 'pp', 'pl', 'le']
>>> [m[0]+m[1] for m in pat.finditer(s2)]
['1.', '.2', '2-', '-3', '3:', ':4']
</code></pre><p><strong>15)</strong> The given input strings contain fields separated by the <code>:</code> character. Delete <code>:</code> and the last field if there is a digit character anywhere before the last field.<pre><code class=language-python>>>> s1 = '42:cat'
>>> s2 = 'twelve:a2b'
>>> s3 = 'we:be:he:0:a:b:bother'
>>> s4 = 'apple:banana-42:cherry:'
>>> s5 = 'dragon:unicorn:centaur'
>>> pat = re.compile(r'(\d.*):.*')
>>> pat.sub(r'\1', s1)
'42'
>>> pat.sub(r'\1', s2)
'twelve:a2b'
>>> pat.sub(r'\1', s3)
'we:be:he:0:a:b'
>>> pat.sub(r'\1', s4)
'apple:banana-42:cherry'
>>> pat.sub(r'\1', s5)
'dragon:unicorn:centaur'
</code></pre><p><strong>16)</strong> Extract all whole words unless they are preceded by <code>:</code> or <code><=></code> or <code>----</code> or <code>#</code>.<pre><code class=language-python>>>> ip = '::very--at<=>row|in.a_b#b2c=>lion----east'
>>> re.findall(r'(?<![:#])(?<!<=>)(?<!-{4})\b\w+', ip)
['at', 'in', 'a_b', 'lion']
</code></pre><p><strong>17)</strong> Match strings if it contains <code>qty</code> followed by <code>price</code> but not if there is any <strong>whitespace</strong> character or the string <code>error</code> between them.<pre><code class=language-python>>>> str1 = '23,qty,price,42'
>>> str2 = 'qty price,oh'
>>> str3 = '3.14,qty,6,errors,9,price,3'
>>> str4 = '42\nqty-6,apple-56,price-234,error'
>>> str5 = '4,price,3.14,qty,4'
>>> str6 = '(qtyprice) (hi-there)'
>>> neg = re.compile(r'qty((?!\s|error).)*price')
>>> bool(neg.search(str1))
True
>>> bool(neg.search(str2))
False
>>> bool(neg.search(str3))
False
>>> bool(neg.search(str4))
True
>>> bool(neg.search(str5))
False
>>> bool(neg.search(str6))
True
</code></pre><p><strong>18)</strong> Can you reason out why the following regular expressions behave differently?<p><code>\b</code> matches both the start and end of word locations. In the below example, <code>\b..\b</code> doesn't necessarily mean that the first <code>\b</code> will match only the start of word location and the second <code>\b</code> will match only the end of word location. They can be any combination! For example, <code>I</code> followed by space in the input string here is using the start of word location for both the conditions. Similarly, space followed by <code>2</code> is using the end of word location for both the conditions.<p>In contrast, the negative lookarounds version ensures that there are no word characters around any two characters. Also, such assertions will always be satisfied at the start of string and the end of string respectively. But <code>\b</code> depends on the presence of word characters. For example, <code>!</code> at the end of the input string here matches the lookaround assertion but not word boundary.<pre><code class=language-python>>>> ip = 'I have 12, he has 2!'
>>> re.sub(r'\b..\b', r'{\g<0>}', ip)
'{I }have {12}{, }{he} has{ 2}!'
>>> re.sub(r'(?<!\w)..(?!\w)', r'{\g<0>}', ip)
'I have {12}, {he} has {2!}'
</code></pre><p><strong>19)</strong> The given input string has comma separated fields and some of them can occur more than once. For the duplicated fields, retain only the rightmost one. Assume that there are no empty fields.<pre><code class=language-python>>>> row = '421,cat,2425,42,5,cat,6,6,42,61,6,6,scat,6,6,4,Cat,425,4'
>>> re.sub(r'(?<![^,])([^,]+),(?=.*(?<![^,])\1(?![^,]))', '', row)
'421,2425,5,cat,42,61,scat,6,Cat,425,4'
</code></pre><br><h1 id=flags><a class=header href=#flags>Flags</a></h1><p><strong>1)</strong> Remove from the first occurrence of <code>hat</code> to the last occurrence of <code>it</code> for the given input strings. Match these markers case insensitively.<pre><code class=language-python>>>> s1 = 'But Cool THAT\nsee What okay\nwow quite'
>>> s2 = 'it this hat is sliced HIT.'
>>> pat = re.compile(r'hat.*it', flags=re.S|re.I)
>>> pat.sub('', s1)
'But Cool Te'
>>> pat.sub('', s2)
'it this .'
</code></pre><p><strong>2)</strong> Delete from <code>start</code> if it is at the beginning of a line up to the next occurrence of the <code>end</code> at the end of a line. Match these markers case insensitively.<pre><code class=language-python>>>> para = '''\
... good start
... start working on that
... project you always wanted
... to, do not let it end
... hi there
... start and end the end
... 42
... Start and try to
... finish the End
... bye'''
>>> pat = re.compile(r'(?ims)^start.*?end$')
>>> print(pat.sub('', para))
good start
hi there
42
bye
</code></pre><p><strong>3)</strong> For the given input strings, match all of these three conditions:<ul><li><code>This</code> case sensitively<li><code>nice</code> and <code>cool</code> case insensitively</ul><pre><code class=language-python>>>> s1 = 'This is nice and Cool'
>>> s2 = 'Nice and cool this is'
>>> s3 = 'What is so nice and cool about This?'
>>> s4 = 'nice,cool,This'
>>> s5 = 'not nice This?'
>>> s6 = 'This is not cool'
>>> pat = re.compile(r'(?i)(?=.*nice)(?=.*cool)(?-i:.*This)')
>>> bool(pat.search(s1))
True
>>> bool(pat.search(s2))
False
>>> bool(pat.search(s3))
True
>>> bool(pat.search(s4))
True
>>> bool(pat.search(s5))
False
>>> bool(pat.search(s6))
False
</code></pre><p><strong>4)</strong> For the given input strings, match if the string begins with <code>Th</code> and also contains a line that starts with <code>There</code>.<pre><code class=language-python>>>> s1 = 'There there\nHave a cookie'
>>> s2 = 'This is a mess\nYeah?\nThereeeee'
>>> s3 = 'Oh\nThere goes the fun'
>>> s4 = 'This is not\ngood\nno There'
>>> pat = re.compile(r'\A(?=Th)(?ms:.*^There)')
>>> bool(pat.search(s1))
True
>>> bool(pat.search(s2))
True
>>> bool(pat.search(s3))
False
>>> bool(pat.search(s4))
False
</code></pre><p><strong>5)</strong> Explore what the <code>re.DEBUG</code> flag does. Here are some example patterns to check out.<ul><li><code>re.compile(r'\Aden|ly\Z', flags=re.DEBUG)</code><li><code>re.compile(r'\b(0x)?[\da-f]+\b', flags=re.DEBUG)</code><li><code>re.compile(r'\b(?:0x)?[\da-f]+\b', flags=re.I|re.DEBUG)</code></ul><br><h1 id=unicode><a class=header href=#unicode>Unicode</a></h1><p><strong>1)</strong> Output <code>True</code> or <code>False</code> depending on input string made up of ASCII characters or not. Consider the input to be non-empty strings and any character that isn't part of 7-bit ASCII set should give <code>False</code>. Do you need regular expressions for this?<pre><code class=language-python>>>> str1 = '123—456'
>>> str2 = 'good fοοd'
>>> str3 = 'happy learning!'
>>> str4 = 'İıſK'
>>> str5 = 'àpple'
>>> str1.isascii()
False
>>> str2.isascii()
False
>>> str3.isascii()
True
>>> str4.isascii()
False
>>> str5.isascii()
False
# check the codepoints if you are wondering why some results are False
>>> [c.encode('unicode_escape') for c in str2]
[b'g', b'o', b'o', b'd', b' ', b'f', b'\\u03bf', b'\\u03bf', b'd']
# you can use character range for regular expression based solution
>>> not bool(re.search(r'[^\x00-\x7f]', str1))
False
</code></pre><p><strong>2)</strong> Does the <code>.</code> quantifier match non-ASCII characters even with the <code>re.ASCII</code> flag enabled?<p>Yes.<pre><code class=language-python>>>> re.search(r'.+', 'fox:αλεπού')[0]
'fox:αλεπού'
>>> re.search(r'(?a).+', 'fox:αλεπού')[0]
'fox:αλεπού'
</code></pre><p><strong>3)</strong> Explore the following stackoverflow Q&A threads.<ul><li><a href=https://stackoverflow.com/q/57553721/4082052>Remove powered number from string</a><li><a href=https://stackoverflow.com/q/1922097/4082052>Regular expression for French characters</a></ul><br><h1 id=regex-module><a class=header href=#regex-module>regex module</a></h1><p><strong>1)</strong> List the two <code>regex</code> module constants that affect the compatibility with the <code>re</code> module. Also specify their corresponding inline flags.<ul><li><code>regex.VERSION0</code> is compatible with the <code>re</code> module (default). Inline flag is <code>(?V0)</code><li><code>regex.VERSION1</code> is needed to use all of the features provided by the <code>regex</code> module. Inline flag is <code>(?V1)</code></ul><p>Set <code>regex.DEFAULT_VERSION</code> to <code>regex.VERSION0</code> or <code>regex.VERSION1</code> to globally configure their usage.<blockquote><p><img alt=info src=images/info.svg> Solutions presented below will assume <code>regex.VERSION1</code> is already set.</blockquote><p><strong>2)</strong> Replace sequences made up of words separated by <code>:</code> or <code>.</code> by the first word of the sequence and the separator. Such sequences will end when <code>:</code> or <code>.</code> is not followed by a word character.<pre><code class=language-python>>>> ip = 'wow:Good:2_two.five: hi-2 bye kite.777:water.'
>>> regex.sub(r'(\w+[:.])(?1)+', r'\1', ip)
'wow: hi-2 bye kite.'
</code></pre><p><strong>3)</strong> The given list of strings has fields separated by the <code>:</code> character. Delete <code>:</code> and the last field if there is a digit character anywhere before the last field.<pre><code class=language-python>>>> items = ['42:cat', 'twelve:a2b', 'we:be:he:0:a:b:bother', 'fig-42:cherry:']
>>> [regex.sub(r'\d.*\K:.*', '', e) for e in items]
['42', 'twelve:a2b', 'we:be:he:0:a:b', 'fig-42:cherry']
</code></pre><p><strong>4)</strong> Extract all whole words unless they are preceded by <code>:</code> or <code><=></code> or <code>----</code> or <code>#</code>.<pre><code class=language-python>>>> ip = '::very--at<=>row|in.a_b#b2c=>lion----east'
>>> regex.findall(r'(?<![:#]|<=>|-{4})\b\w+', ip)
['at', 'in', 'a_b', 'lion']
</code></pre><p><strong>5)</strong> The given input string has fields separated by the <code>:</code> character. Extract field contents only if the previous field contains a digit character.<pre><code class=language-python>>>> ip = 'vast:a2b2:ride:in:awe:b2b:3list:end'
>>> regex.findall(r'(?<=\d[^:]*:)[^:]+', ip)
['ride', '3list', 'end']
</code></pre><p><strong>6)</strong> The given input strings have fields separated by the <code>:</code> character. Assume that each string has a minimum of two fields and cannot have empty fields. Extract all fields, but stop if a field with a digit character is found.<pre><code class=language-python>>>> row1 = 'vast:a2b2:ride:in:awe:b2b:3list:end'
>>> row2 = 'um:no:low:3e:s4w:seer'
>>> row3 = 'oh100:apple:banana:fig'
>>> row4 = 'Dragon:Unicorn:Wizard-Healer'
>>> pat = regex.compile(r'\G([^\d:]+)(?::|\Z)')
>>> pat.findall(row1)
['vast']
>>> pat.findall(row2)
['um', 'no', 'low']
>>> pat.findall(row3)
[]
>>> pat.findall(row4)
['Dragon', 'Unicorn', 'Wizard-Healer']
</code></pre><p><strong>7)</strong> For the given input strings, extract <code>if</code> followed by any number of nested parentheses. Assume that there will be only one such pattern per input string.<pre><code class=language-python>>>> ip1 = 'for (((i*3)+2)/6) if(3-(k*3+4)/12-(r+2/3)) while()'
>>> ip2 = 'if+while if(a(b)c(d(e(f)1)2)3) for(i=1)'
>>> pat = regex.compile(r'if(\((?:[^()]++|(?1))++\))')
>>> pat.search(ip1)[0]
'if(3-(k*3+4)/12-(r+2/3))'
>>> pat.search(ip2)[0]
'if(a(b)c(d(e(f)1)2)3)'
</code></pre><p><strong>8)</strong> Read about the <code>POSIX</code> flag from <a href=https://pypi.org/project/regex/>https://pypi.org/project/regex/</a>. Is the following code snippet showing the correct output?<p>Yes. Longest match wins in <code>POSIX</code> implementations. Alternation order comes into play only when the matching portions have the same length.<pre><code class=language-python>>>> words = 'plink incoming tint winter in caution sentient'
>>> change = regex.compile(r'int|in|ion|ing|inco|inter|ink', flags=regex.POSIX)
>>> change.sub('X', words)
'plX XmX tX wX X cautX sentient'
</code></pre><p>For the same length cases, the usual left-to-right priority is applied for the alternations. For example:<pre><code class=language-python>>>> ip = 'tryst,fun,glyph,pity,why,group'
>>> regex.sub(r'\b\w+\b|(\b[gp]\w*y\w*\b)', r'\1', ip, flags=regex.POSIX)
',,,,,'
>>> regex.sub(r'(\b[gp]\w*y\w*\b)|\b\w+\b', r'\1', ip, flags=regex.POSIX)
',,glyph,pity,,'
</code></pre><p><strong>9)</strong> Extract all whole words for the given input strings. However, based on the user input <code>ignore</code>, do not match words if they contain any character present in the <code>ignore</code> variable.<pre><code class=language-python>>>> s1 = 'match after the last new_line character A2'
>>> s2 = 'and then you want to test'
>>> ignore = 'aty'
>>> pat = regex.compile(rf'\b[\w--[{ignore}]]+\b')
>>> pat.findall(s1)
['new_line', 'A2']
>>> pat.findall(s2)
[]
>>> ignore = 'esw'
# should be the same solution used above
>>> pat = regex.compile(rf'\b[\w--[{ignore}]]+\b')
>>> pat.findall(s1)
['match', 'A2']
>>> pat.findall(s2)
['and', 'you', 'to']
</code></pre><p><strong>10)</strong> Retain only the punctuation characters for the given strings (generated from codepoints). Consider the characters defined by the Unicode set <code>\p{P}</code> as punctuations for this exercise.<pre><code class=language-python>>>> s1 = ''.join(chr(c) for c in range(0, 0x80))
>>> s2 = ''.join(chr(c) for c in range(0x80, 0x100))
>>> s3 = ''.join(chr(c) for c in range(0x2600, 0x27ec))
# r'\p{^P}+' can also be used
>>> pat = regex.compile(r'\P{P}+')
>>> pat.sub('', s1)
'!"#%&\'()*,-./:;?@[\\]_{}'
>>> pat.sub('', s2)
'¡§«¶·»¿'
>>> pat.sub('', s3)
'❨❩❪❫❬❭❮❯❰❱❲❳❴❵⟅⟆⟦⟧⟨⟩⟪⟫'
</code></pre><p><strong>11)</strong> For the given <strong>markdown</strong> file, replace all occurrences of the string <code>python</code> (irrespective of case) with the string <code>Python</code>. However, any match within code blocks that starts with the whole line <code>```python</code> and ends with the whole line <code>```</code> shouldn't be replaced. Consider the input file to be small enough to fit memory requirements.<p>Refer to the <a href=https://github.com/learnbyexample/py_regular_expressions/tree/master/exercises>exercises folder</a> for the files <code>sample.md</code> and <code>expected.md</code> required to solve this exercise.<pre><code class=language-python>>>> ip_str = open('sample.md', 'r').read()
>>> pat = regex.compile(r'(?ms)^```python$.*?^```$(*SKIP)(*F)|(?i:python)')
>>> with open('sample_mod.md', 'w') as op_file: