How can I parse the emotes out of a Twitch IRC response into an list of dictionaries?

python python-3.x parsing python-3.5 twitch

657 просмотра

2 ответа

1606 Репутация автора

I would like to parse an IRC message from Twitch to a list of dictionaries, accounting for emotes.

Here is a sample of what I can get from Twitch:

"Testing. :) Confirmed!"

{"emotes": [(1, (9, 10))]}

It describes that there is the emote with ID 1 from characters 9 to 10 (with the string being zero-indexed).

I would like to have my data in the following format:

[
    {
        "type": "text",
        "text": "Testing. "
    },
    {
        "type": "emote",
        "text": ":)",
        "id": 1
    },
    {
        "type": "text",
        "text": " Confirmed!"
    }
]

Is there a relatively clean way to accomplish this?

Автор: 2Cubed Источник Размещён: 14.07.2016 07:30

Ответы (2)


0 плюса

1606 Репутация автора

I have found a solution which, although quite ugly, works.

import re

packet_expression = re.compile(r'(@.+)? :([a-zA-Z0-9][\w]{2,23})!\2@\2.tmi.twitch.tv PRIVMSG #([a-zA-Z0-9][\w]{2,23}) :(.+)')

def parse_twitch(packet):

    match = re.match(packet_expression, packet)

    items = match.group(1)[1:].split(';')
    tags = dict(item.split('=') for item in items)

    emote_expression = re.compile(r'(\d+):((\d+-\d+,)*\d+-\d+)')
    tags["emotes"] = [
        (int(emotes[0]), (int(start), int(end)))
        for emotes in re.findall(emote_expression, tags.get("emotes", ''))
        for location in emotes[1].split(',')
        for start, end in [location.split('-')]
    ]

    message = match.group(4)
    characters = list(message)

    offset = 0
    for emote in tags["emotes"]:
        characters[emote[1][0]-offset : emote[1][1]-offset+1] = [{
            "type": "emote",
            "text": ''.join(characters[emote[1][0]-offset : emote[1][1]-offset+1]),
            "id": emote[0]
        }]
        offset += emote[1][1] - emote[1][0]

    index = 0
    while any(isinstance(item, str) for item in characters):
        if isinstance(characters[index], str) and isinstance(characters[index+1], str):
            characters[index:index+2] = [characters[index] + characters[index+1]]
        else:
            if isinstance(characters[index], str):
                characters[index] = {"type": "text", "text": characters[index]}
            index += 1

    return characters
Автор: 2Cubed Размещён: 28.07.2016 12:37

2 плюса

29620 Репутация автора

Решение

I'm not sure if your incoming message looks like this:

message = '''\
"Testing. :) Confirmed!"

{"emotes": [(1, (9, 10))]}'''

Or

text = "Testing. :) Confirmed!"
meta = '{"emotes": [(1, (9, 10))]}'

I'm going to assume it's the latter, because it's easy to convert from the former to the latter. It could also be that those are the python representations. You weren't very clear.

There's a vastly better way to approach this problem by not using regexes and just using string parsing:

import json                                                                                                                                                                                                                     

text = 'Testing. :) Confirmed! :P'                                                                                                                                                                                              
print(len(text))                                                                                                                                                                                                                
meta = '{"emotes": [(1, (9, 10)), (2, (23,25))]}'                                                                                                                                                                               
meta = json.loads(meta.replace('(', '[').replace(')', ']'))                                                                                                                                                                     


results = []                                                                                                                                                                                                                    
cur_index = 0                                                                                                                                                                                                                   
for emote in meta['emotes']:                                                                                                                                                                                                    
    results.append({'type': 'text', 'text': text[cur_index:emote[1][0]]})                                                                                                                                                       
    results.append({'type': 'emote', 'text': text[emote[1][0]:emote[1][1]+1],                                                                                                                                                   
                    'id': emote[0]})                                                                                                                                                                                            
    cur_index = emote[1][1]+1                                                                                                                                                                                                   

if text[cur_index:]:                                                                                                                                                                                                            
    results.append({'type': 'text', 'text': text[cur_index:]})                                                                                                                                                                  

import pprint; pprint.pprint(results)      

From your comment, the data comes in a custom format. There were a couple of characters that I copy/pasted from the comment that I'm not sure actually show up in the incoming data, I hope I got that part right. There was also only one type of emote in the message so I'm not entirely sure how it denotes multiple different emote types - I'm hoping that there's some separator and not multiple emote= sections, or this approach needs some heavy modifications, but this should provide the parsing without the need for regex.

from collections import namedtuple


Emote = namedtuple('Emote', ('id', 'start', 'end'))


def parse_emotes(raw):
    emotes = []
    for raw_emote in raw.split('/'):
        id, locations = raw_emote.split(':')
        id = int(id)
        locations = [location.split('-')
                     for location in locations.split(',')]
        for location in locations:
            emote = Emote(id=id, start=int(location[0]), end=int(location[1]))
            emotes.append(emote)
    return emotes

data = r'@badges=moderator/1;color=#0000FF;display-name=2Cubed;emotes=25:6-10,12-16;id=05aada01-f8c1-4b2e-a5be-2534096057b9;mod=1;room-id=82607708;subscriber=0;turbo=0;user-id=54561464;user-type=mod:2cubed!2cubed@2cubed.tmi.twitch.tv PRIVMSG #innectic :Hiya! Kappa Kappa'

meta, msgtype, channel, message = data.split(' ', maxsplit=3)
meta = dict(tag.split('=') for tag in meta.split(';'))
meta['emotes'] = parse_emotes(meta['emotes'])
Автор: Wayne Werner Размещён: 28.07.2016 01:14
Вопросы из категории :
32x32